Search through a text file in between to specified characters - python

I want to search a specified section of the attached text file based on the '#' character. Basically, I want to look at all the data starting at a found '#' and end at a line with '#'. In these sections, I'd also like to look for a specified string. I am coding in python.
TEXT FILE (The bolded events have a '#' symbol in front of the number ex: the real file reads '#1 Women 1000 Yd Free')
FAST - CA
HY-TEK's MEET MANAGER 7.0 - 11:32 PM 2/11/2018 Page 1
2018 Presidents' Day Senior Swimming Classic
Psych Sheet
#1 Women 1000 Yd Free
10:39.39 SECT
Name
Age
Team
1 Zamora Gallegos, Mariana
15 BC
2 Arzave, Juli
16
SBA-SI
Seed Time
1:00.21 SECT
10:02.58 SECT
3 Aguilar Ortega, Martha
18 Ruth
BC
10:05.72 SECT
4 Nowaski, Danielle 17
10:13.74 SECT
CAST-SI
5 Miranda Aguilar, Danitza
17 BC
10:23.30 SECT
6 Moreno Osuna, Ashely
16 Dariela
BC
10:23.70 SECT
7 Motekaitis, Mia
17
UCD-SN
10:24.96 SECT
8 Gardner, Amber
15
CROW-PC
10:27.86 SECT
9 Urias Quijas, Sophie14
AprilBC
53 Macias Ruiz, Alejandra
17 BC
11:23.81
45 Suehiro, Alex
54 Fuller, Monica
SMSC-CA
11:29.23
14 BC
46 Gallegos Portugal, Hanson
10:26.01
MP-PC
11:30.00
47 O'Connell, Daniel 17
CROW-PC
10:27.13
13 BC
56 Mendoza Camilo, Arely
11:30.58
48 Monge, Colin
PS-SI
10:27.63
57 DePaco, Lexi
15
CROW-PC
11:31.13
17 BC
49 Mejia Matamoros, Miguel
10:28.50
58 Morgan, Chloe
12
MRA-SI
11:57.36
50 Cebreros Gracia, Jorge
16
BC
10:30.06
59 Ferguson, Lizzie
18
MP-PC
12:12.40
51 Simpson, Aidan
ICAC-SI
10:30.20
60 Vera, Daniela
19
BERIM
52 Cordova Medina, Ivar
17
BC
10:30.57
53 Rascon, Esteban
SBA-SI
10:33.35
54 Mota Ezpinoza, Leonel
13
BC
10:34.09
55 Marsalek, Asa
16
SMSC-CA
10:34.14
56 Shitole, Viraj
17
DACA-PC
10:39.53
57 Friedrich, Aaron
15
SMSC-CA
10:40.67
58 Sokalzuk, Samuel 14
MP-PC
10:40.89
59 Jin, Lei
ICAC-SI
10:56.89
14
55 Duro, Dominique 18
9:12.23L SECT
#2 Men 1000 Yd Free

Begin by using either the string.find("#"). It will return -1 if the hashtag doesn't exist. Once you find it, make an outer loop that iterates until you find the next hashtag. If you do find a hashtag, search the next 4-5 elements. For example, search if it says "women" or "men", contains a number for the distance. For example, for the distance, the isDigit method is perfect.
Inside that loop, check with your list of names and confirm when you have a match. For example, try using the == to check if the name matches. Once you do find the name, check the next 4 pieces of information and copy those down as a string.
Given about 30 names, and a list of about 3000 swimmers, it makes close to 100,000 comparisons meaning it could take a long time to run the program.
Outer loop (checks or #)
if(#) --> check event, gender, etc.., and reset the global variable to the event so that the swimmers can be saved to it
else --> do name comparisons

Related

How to parse HTML table that is inside div and not table in Python

I am trying to parse the table from this website. I started with just the Username column and with the help I got on stackoverflow, I was able to get the content of Username with the following code:
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(str(file.readlines()), "html.parser")
tiktok = []
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
tiktok.append(tag.text)
which gives me
['addison rae',
'Bella Poarch',
'Zach King',
'TikTok',
'Spencer X',
'Will Smith',
'Loren Gray',
'dixie',
'Michael Le',
'Jason Derulo',
'Riyaz',
.
.
.
My ultimate goal is to populate the entire table with [Rank, Grade, Username, Uploads, Followers, Following, Likes]
I have read a few articles on Parsing HTML Tables in Python with BeautifulSoup and pandas but it didnโ€™t work since this is not defined as a table in the source. What are some of the alternatives to get this as a table in Python?
You can use this code how to load the HTML from file to soup and then parse the table into dataframe:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
],
)
print(df)
df.to_csv("data.csv", index=False)
Prints:
Rank Grade Username Uploads Followers Following Likes Interactions
0 1st A++ charli dโ€™amelio 1,755 113,600,000 1,210 9,200,000,000 --
1 2nd A++ addison rae 1,411 79,900,000 2,454 5,100,000,000 --
2 3rd A++ Bella Poarch 282 63,600,000 588 1,400,000,000 --
3 4th A++ Zach King 277 58,800,000 41 723,400,000 --
4 5th A++ TikTok 139 52,900,000 495 250,300,000 91
5 6th A++ Spencer X 1,250 52,700,000 7,206 1,300,000,000 --
6 7th A++ Will Smith 73 52,500,000 23 314,400,000 --
7 8th A++ Loren Gray 2,805 52,100,000 221 2,800,000,000 --
8 9th A++ dixie 120 51,200,000 1,267 2,900,000,000 --
9 10th A++ Michael Le 1,158 47,400,000 93 1,300,000,000 --
10 11th A+ Jason Derulo 675 44,900,000 12 1,000,000,000 --
11 12th A+ Riyaz 2,056 44,100,000 43 2,100,000,000 --
12 13th A+ Kimberly Loaiza โœจ 1,150 41,000,000 123 2,200,000,000 --
13 14th A+ Brent Rivera 955 37,800,000 272 1,200,000,000 --
14 15th A+ cznburak 1,301 37,300,000 1 688,700,000 --
15 16th A+ The Rock 42 36,200,000 1 200,300,000 --
16 17th A+ James Charles 238 36,200,000 148 881,400,000 --
17 18th A+ BabyAriel 2,365 35,300,000 326 1,900,000,000 --
18 19th A+ JoJo Siwa 1,206 33,500,000 346 1,100,000,000 --
19 20th A+ avani 5,347 33,300,000 5,003 2,400,000,000 --
20 21st A+ GIL CROES 693 32,900,000 454 803,200,000 --
21 22nd A+ Faisal shaikh 461 32,200,000 -- 2,000,000,000 --
22 23rd A+ BTS 39 32,000,000 -- 557,100,000 255
23 24th A+ LILHUDDY 4,187 30,500,000 8,652 1,600,000,000 --
24 25th A+ Stokes Twins 548 30,100,000 21 781,000,000 --
25 26th A+ Joe 1,487 29,800,000 8,402 1,200,000,000 --
26 27th A+ ROD๐Ÿฅด 1,792 29,500,000 536 1,700,000,000 --
27 28th A+ ๐™ณ๐š˜๐š–๐š’๐š—๐š’๐š” 899 29,400,000 216 1,700,000,000 --
28 29th A+ Kylie Jenner 69 29,400,000 14 318,800,000 --
29 30th A+ Junya/ใ˜ใ‚…ใ‚“ใ‚„ 2,823 29,000,000 1,934 533,800,000 12,200
30 31st A+ YZ 816 28,900,000 563 554,700,000 --
31 32nd A+ Arishfa Khan๐Ÿฆ 2,026 28,600,000 27 1,100,000,000 --
32 33rd A+ Lucas and Marcus 1,248 28,500,000 158 806,500,000 --
33 34th A+ jannat_zubair29 1,054 28,200,000 6 746,300,000 47
34 35th A+ Nisha Guragain 1,751 28,000,000 33 756,300,000 --
35 36th A+ Selena Gomez 40 27,800,000 17 82,300,000 --
36 37th A+ Kris HC 1,049 27,800,000 1,405 1,200,000,000 --
37 38th A+ flighthouse 4,200 27,600,000 488 2,300,000,000 --
38 39th A+ wigofellas 1,251 27,500,000 812 707,200,000 --
39 40th A+ Savannah LaBrant 1,860 27,300,000 155 1,400,000,000 --
40 41st A+ noah beck 1,395 26,900,000 2,297 1,700,000,000 --
41 42nd A+ Liza Koshy 155 26,700,000 104 321,900,000 --
42 43rd A+ Kirya Kolesnikov 1,338 26,400,000 78 543,200,000 --
43 44th A+ Awez Darbar 2,708 26,100,000 208 1,100,000,000 --
44 45th A+ Carlos Feria 2,522 25,700,000 138 1,200,000,000 --
45 46th A+ Kira Kosarin 837 25,700,000 401 447,000,000 --
46 47th A+ Naim Darrechi๐Ÿ† 2,634 25,300,000 527 2,200,000,000 --
47 48th A+ Josh Richards 1,899 24,900,000 9,847 1,600,000,000 --
48 49th A+ Q Park 231 24,800,000 3 294,100,000 --
49 50th A+ TikTok_India 186 24,500,000 191 40,100,000 --
And saves data.csv (screenshot from LibreOffice):
EDIT: To get URL username:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
+ [div.a["href"].split("/")[-1]]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
"URL username",
],
)
print(df)
df.to_csv("data.csv", index=False)

Highlighting Values in Pandas

Good Evening all,
I have made a pandas DataFrame from an excel spreadsheet. I am trying to highlight the names in this list that have logged in at 9:01:00 etc. Anyone who has logged in past the hour or half hour by 1 minute, but excluding those that have logged in early eg 07:59:00 or 07:29:00. EG. Those with * around the time. I am a complete amateur coder so I apologise. If things could be put in the simplest form without assuming a great degree of knowledge I would very much appreciate it. Also, if this is incredibly complex/ impossible I also apologise.
Name Login\nTime
0 ITo 07:59:09
1 Ann 07:59:13
2 Darryll 07:59:24
3 Darren 07:59:31
4 FlorR 07:59:42
5 Colm 07:59:56
6 NatashaBr 07:59:59
7 AlexRobe 07:59:59
8 JonathanSinn 08:00:02
9 BrendanJo 08:00:04
10 DanielCov 08:00:15
11 RW 08:00:17
12 SaraHerrma 08:00:26
13 RobertStew 08:00:37
14 JasonBal *08:04:36*
17 KevinAll 08:59:52
18 JFo 09:00:05
19 LiviaHarr 09:00:22
20 Patrick *09:01:36*
24 SianDi 09:30:32
25 AlisonBri 09:59:27
26 MMulholl 10:00:02
27 TiffanyThom 10:00:07
29 GeorgeEdw 11:00:00
30 JackSha 11:00:50
31 UsmanA 11:59:46
32 LewisBrad 12:02:30
34 RyanmacCor 12:59:20
35 GerardMcphil 12:59:56
36 TanjaN 13:00:07
37 MartinRichar 13:30:08
38 MarkBellin 13:30:20
39 KyranSpur 13:30:24
40 RichRam 13:58:53
41 OctavioSan 14:30:10
42 CharlesS 16:45:07
43 DanielHoll 16:50:55
44 ThomasHoll 16:59:45
45 RosieFl 16:59:56
46 CiaranMur 17:00:01
47 LouiseDa 17:29:29
48 WilliamAi 17:30:02
You can have a look at the Pandas styling options. It has an applymap function which helps you to color code specific columns based on conditions of your choice.
The documention (https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) has some examples, you can go through these and decide how you would want to highlight the column values.
Assuming you have a DataFrame called df. You can define your own styling function func() and apply it your DataFrame.
df.style.apply(func)
You can define your styling function based on the examples in the documentation. Let me know if you have any more questions.

Protect one specific case in regex in python

I need to replace german phone numbers in python, which is well-explained here:
Regexp for german phone number format
Possible formats are:
06442) 3933023
(02852) 5996-0
(042) 1818 87 9919
06442 / 3893023
06442 / 38 93 02 3
06442/3839023
042/ 88 17 890 0
+49 221 549144 โ€“ 79
+49 221 - 542194 79
+49 (221) - 542944 79
0 52 22 - 9 50 93 10
+49(0)121-79536 - 77
+49(0)2221-39938-113
+49 (0) 1739 906-44
+49 (173) 1799 806-44
0173173990644
0214154914479
02141 54 91 44 79
01517953677
+491517953677
015777953677
02162 - 54 91 44 79
(02162) 54 91 44 79
I am using the following code:
df['A'] = df['A'].replace(r'(\(?([\d \-\)\โ€“\+\/\(]+)\)?([ .\-โ€“\/]?)([\d]+))', r'\TEL', regex=True)
The Problem is I have dates in the text:
df['A']
2017-03-07 13:48:39 Dear Sear Madam...
This is necassary to keep, how can I exclude the format: 2017-03-07and 13:48:39from my regex replacement?
Short Example:
df['A']
2017-03-077
2017-03-07
0211 11112244
desired output:
df['A']
TEL
2017-03-07
TEL
Any way you slice it you are not dealing with regular data and regular expressions work best with regular data. You are always going to run into "false positives" in your situation.
Your best bet is to write out each pattern individually as a giant OR. Below is the pattern for the first three phone numbers so just do the rest of them.
\d{5}\) \d{7}|\(\d{5}\) \d{4}-\d|\(\d{3}\) \d{4} \d{2} \d{4}
https://regex101.com/r/6NPzup/1

Combining Rows in a DataFrame

I have a DF that has the results of a NER classifier such as the following:
df =
s token pred tokenID
17 hakawati B-Loc 3
17 theatre L-Loc 3
17 jerusalem U-Loc 7
56 university B-Org 5
56 of I-Org 5
56 texas I-Org 5
56 here L-Org 6
...
5402 dwight B-Peop 1
5402 d. I-Peop 1
5402 eisenhower L-Peop 1
There are many other columns in this DataFrame that are not relevant. Now I want to group the tokens depending on their sentenceID (=s) and their predicted tags to combine them into a single entity:
df2 =
s token pred
17 hakawati theatre Location
17 jerusalem Location
56 university of texas here Organisation
...
5402 dwight d. eisenhower People
Normally I would do so by simply using a line like
data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join) and using a rename function. However since the data contains different kind of Strings (B,I,L - Loc/Org ..) I don't know how to exactly do it.
Any ideas are appreciated.
Any ideas?
One solution via a helper column.
df['pred_cat'] = df['pred'].str.split('-').str[-1]
res = df.groupby(['s', 'pred_cat'])['token']\
.apply(' '.join).reset_index()
print(res)
s pred_cat token
0 17 Loc hakawati theatre jerusalem
1 56 Org university of texas here
2 5402 Peop dwight d. eisenhower
Note this doesn't match exactly your desired output; there seems to be some data-specific treatment involved.
You could group by both s and tokenID and aggregate like so:
def aggregate(df):
token = " ".join(df.token)
pred = df.iloc[0].pred.split("-", 1)[1]
return pd.Series({"token": token, "pred": pred})
df.groupby(["s", "tokenID"]).apply(aggregate)
# Output
token pred
s tokenID
17 3 hakawati theatre Loc
7 jerusalem Loc
56 5 university of texas Org
6 here Org
5402 1 dwight d. eisenhower Peop

Remove rows where a column contains a specific substring [duplicate]

This question already has answers here:
Search for "does-not-contain" on a DataFrame in pandas
(9 answers)
Closed 2 years ago.
how to eliminate rown that have a word i don't want?
I have this DataFrame:
index price description
0 15 Kit 10 Esponjas Para Cartuchos Jato De Tinta ...
1 15 Snap Fill Para Cartuchos Hp 60 61 122 901 21 ...
2 16 Clips Para Cartuchos Hp 21 22 60 74 75 92 93 ...
I'm trying to remove the rown with the word 'esponja'
i want a DataFrame like this:
index price description
1 15 Snap Fill Para Cartuchos Hp 60 61 122 901 21 ...
2 16 Clips Para Cartuchos Hp 21 22 60 74 75 92 93 ...
i'm newbe, i don't have any idea how to resolve that
Create a boolean mask by checking for strings that contain 'Esponjas', then index into your dataframe with the negated mask.
df[~df['description'].str.contains('Esponjas')]
If you are unsure what's going on, print out what
df['description']
df['description'].str.contains('Esponjas')
~df['description'].str.contains('Esponjas')
do on their own. If you want to perform the substring-check case-insensitively, use case=False as a keyword argument to str.contains.

Categories