I'm tidying a string in Python and need to substitute some of the text (following a certain rule) using Regex. In the string (copied below), a place is usually mentioned followed by a comma and then the city's associated mortality rate. The next place is separated with a semi-colon. However there are some examples where the semi-colon is missing and I need to use Regex to add that semi-colon back in (e.g. 'Plymouth, 19 Portsmouth, 15' should be 'Plymouth, 19; Portsmouth, 15').
The text is as follows:
Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18 ; Bradford, 16 ; Brighton, 14 Bristol, 20; Cardiff, 25 ; Derby, 12 ; Halifax, 20; Biddersfield, 21 ; Hull, 19 ; Leeds, 22 ; Leicester, 18 ; London, 17; Manchester,15 ; Norwich, 24; Nottingham, 21; Oldham, 18 ; Plymouth, 19 Portsmouth, 15 ; Preston, 23 ; Salford, 14 ; Sheffield, 16 ; Sunderland, 18; Wolverhampton. 30. The rate in Edinburgh was 14 ;in Glasgow, 23 ; and in Dublin. 22.
I've tried using re.sub() for this using the following formula and using non-capture sets but am doing something very horribly wrong!
mystring = [the string here]
re.sub("(?:[0-9])?\s(?:[A-Z0-9]?)", ";", mystring)
Is anyone able to help me fix this?
Thank you!
You can use a raw string with the following regex (?<=\w),\s+(?!\w), using the .sub() function as you were using before to replace the commas with semicolons, here is the code, tell me if you need anymore help, I'll be glad to assist (:
import re
mystring = "Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18 ; Bradford, 16 ; Brighton, 14 Bristol, 20; Cardiff, 25 ; Derby, 12 ; Halifax, 20; Biddersfield, 21 ; Hull, 19 ; Leeds, 22 ; Leicester, 18 ; London, 17; Manchester,15 ; Norwich, 24; Nottingham, 21; Oldham, 18 ; Plymouth, 19 Portsmouth, 15 ; Preston, 23 ; Salford, 14 ; Sheffield, 16 ; Sunderland, 18; Wolverhampton. 30. The rate in Edinburgh was 14 ;in Glasgow, 23 ; and in Dublin. 22."
newstring = re.sub(r"(?<=\w),\s+(?!\w)", "; ", mystring)
print(newstring)
Output should be:
Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18; Bradford, 16; Brighton, 14; Bristol, 20; Cardiff, 25; Derby, 12; Halifax, 20; Biddersfield, 21; Hull, 19; Leeds, 22; Leicester, 18; London, 17; Manchester, 15; Norwich, 24; Nottingham, 21; Oldham, 18; Plymouth, 19; Portsmouth, 15; Preston, 23; Salford, 14; Sheffield, 16; Sunderland, 18; Wolverhampton. 30. The rate in Edinburgh was 14; in Glasgow, 23; and in Dublin. 22.
You might use:
\w,\s*\d+\b(?!\s*;)
Explanation
\w, Match a word char followed by a comma
\s*\d+\b Match optional whitespace chars followed by 1+ digits and a word boundary
(?!\s*;) Negative lookahead, assert not optional whitespace chars followed by ; to the right
In the replacement use the full match followed by a semicolon \g<0>;
See a regex101 demo and a Python demo.
Example
import re
pattern = r"\w,\s*\d+\b(?!\s*;)"
s = "Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18 ; Bradford, 16 ; Brighton, 14 Bristol, 20; Cardiff, 25 ; Derby, 12 ; Halifax, 20; Biddersfield, 21 ; Hull, 19 ; Leeds, 22 ; Leicester, 18 ; London, 17; Manchester,15 ; Norwich, 24; Nottingham, 21; Oldham, 18 ; Plymouth, 19 Portsmouth, 15 ; Preston, 23 ; Salford, 14 ; Sheffield, 16 ; Sunderland, 18; Wolverhampton. 30. The rate in Edinburgh was 14 ;in Glasgow, 23 ; and in Dublin. 22."
result = re.sub(pattern, r"\g<0>;", s)
if result:
print (result)
Output
Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18 ; Bradford, 16 ; Brighton, 14; Bristol, 20; Cardiff, 25 ; Derby, 12 ; Halifax, 20; Biddersfield, 21 ; Hull, 19 ; Leeds, 22 ; Leicester, 18 ; London, 17; Manchester,15 ; Norwich, 24; Nottingham, 21; Oldham, 18 ; Plymouth, 19; Portsmouth, 15 ; Preston, 23 ; Salford, 14 ; Sheffield, 16 ; Sunderland, 18; Wolverhampton. 30. The rate in Edinburgh was 14 ;in Glasgow, 23 ; and in Dublin. 22.
Or without matching newlines and starting the match with a char a-zA-Z
[a-zA-Z],[^\S\n]*\d+\b(?![^\S\n]*;)
See another regex101 demo
Plymouth, Wolverhampton, Edinburgh, Glasgow and Dublin all seemed to have issues with the formatting not being as expected.
My assumption is that that you want the city name and the death rate to be extracted from this data. Because of the extra text in some of the cities that makes it difficult.
I would use the numeric values to split the data then use the last word with a capital letter to assume the city. Store these two bits of information in a Python dictionary.
pattern1 = r"(.+?)(\d+)"
pattern2 = r".*([A-Z]\w+)"
all_info = {}
# split to have letter before numbers and numbers
for letters, numbers in re.findall(pattern1, mystring):
# Find the city name in the letters
location = re.match(pattern2, letters).group(1)
# Build a dictionary of the information
all_info[location] = int(numbers)
If you want the specific string you are looking for then you can create that from the Python dictionary.
new_str = '; '.join([f'{key}, {value}' for (key, value) in all_info.items()]) + ';'
My full test I used was:
import json
import re
mystring = (
"Birkenhead, 16; "
"Birmingham, 15; "
"Blackburn, 16; "
"Bolton, 18 ; "
"Bradford, 16 ; "
"Brighton, 14 "
"Bristol, 20; "
"Cardiff, 25 ;"
" Derby, 12 ;"
" Halifax, 20;"
" Biddersfield, 21 ;"
" Hull, 19 ;"
" Leeds, 22 ;"
" Leicester, 18 ;"
" London, 17;"
" Manchester,15 ;"
" Norwich, 24;"
" Nottingham, 21;"
" Oldham, 18 ;"
" Plymouth, 19"
" Portsmouth, 15 ;"
" Preston, 23 ;"
" Salford, 14 ;"
" Sheffield, 16 ;"
" Sunderland, 18;"
" Wolverhampton. 30."
" The rate in Edinburgh was 14 ;"
"in Glasgow, 23 ;"
" and in Dublin. 22."
)
pattern1 = r"(.+?)(\d+)"
pattern2 = r".*([A-Z]\w+)"
all_info = {}
# split to have letter before numbers and numbers
for letters, numbers in re.findall(pattern1, mystring):
# Find the city name in the letters
location = re.match(pattern2, letters).group(1)
# Build a dictionary of the information
all_info[location] = int(numbers)
print(f"as json data:\n{json.dumps(all_info, indent=4)}")
new_str = '; '.join([f'{key}, {value}' for (key, value) in all_info.items()]) + ';'
print(f"as new string:\n{new_str}")
Which gave the following output:
as json data:
{
"Birkenhead": 16,
"Birmingham": 15,
"Blackburn": 16,
"Bolton": 18,
"Bradford": 16,
"Brighton": 14,
"Bristol": 20,
"Cardiff": 25,
"Derby": 12,
"Halifax": 20,
"Biddersfield": 21,
"Hull": 19,
"Leeds": 22,
"Leicester": 18,
"London": 17,
"Manchester": 15,
"Norwich": 24,
"Nottingham": 21,
"Oldham": 18,
"Plymouth": 19,
"Portsmouth": 15,
"Preston": 23,
"Salford": 14,
"Sheffield": 16,
"Sunderland": 18,
"Wolverhampton": 30,
"Edinburgh": 14,
"Glasgow": 23,
"Dublin": 22
}
as new string:
Birkenhead, 16; Birmingham, 15; Blackburn, 16; Bolton, 18; Bradford, 16; Brighton, 14; Bristol, 20; Cardiff, 25; Derby, 12; Halifax, 20; Biddersfield, 21; Hull, 19; Leeds, 22; Leicester, 18; London, 17; Manchester, 15; Norwich, 24; Nottingham, 21; Oldham, 18; Plymouth, 19; Portsmouth, 15; Preston, 23; Salford, 14; Sheffield, 16; Sunderland, 18; Wolverhampton, 30; Edinburgh, 14; Glasgow, 23; Dublin, 22;
Thanks all! In the end, with some experimenting with my SO, I have found a different solution:
re.sub("(?<=\d)[ ;]+", "; ", mystring)
Effectively, we just look for all cases of one or more spaces and/or semi-colons which are preceded by a number (using a look-back) and then replace the match with ; .
Related
I have the following string printed from a python dataframe:
ALIF SASETYO, NIK: 3171060201830005 NPWP: 246383541071000 TTL: Jakarta, 02 Januari 1983 ARIEF HERMAWAN, NIK: 1271121011700003 NPWP: 070970173112000 TTL: Bogor, 10 November 1970 ARLAN SEPTIA ANANDA RASAM, NIK: 3174051209620003 NPWP: 080878200013000 TTL: Jakarta, 12 September 1962 CHAIRAL TANJUNG, NIK: 3171011605660004 NPWP: 070141650093000 TTL: Jakarta, 16 Mei 1966 FUAD RIZAL, NIK: 3174010201780008 NPWP: 488337379015000 TTL: Jakarta, 02 Januari 1978 Ir. R AGUS HARYOTO PURNOMO, UTAMA RASRINIK: 3578032408610001 NPWP: 097468813615000 TTL: SLEMAN, 24 Agustus 1961 PT CTCORP INFRASTRUKTUR D INDONESIA, Nomor SK :- I JalanKaptenPierreTendeanKavling12-14A PT INTRERPORT PATIMBAN AGUNG, Nomor SK :- PT PATIMBAN MAJU BERSAMA, Nomor SK :AHU- 0061318.AH.01.01.TAHUN 2021 Tanggal SK :30 September 2021 PT TERMINAL PETIKEMAS SURABAYA, Nomor SK :- Nama YUKKI NUGRAHAWAN HANAFI, NIK: 3174060211670004 NPWP: 093240992016000 TTL: Jakarta, 02 November 1967
which I extracted through the following code:
import pandas as pd
import re
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
pattern = re.compile(r'[A-Z]+\s[]+\s{}[A-Z]+[,]')
officers_df = df[(~df["Nama"].str.startswith("NIK:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("NPWP:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("TTL:") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Nomor SK") & (df["Jabatan"] != "-"))]
officers_df = df[(~df["Nama"].str.startswith("Tanggal SK") & (df["Jabatan"] != "-"))]
officers_df.reset_index(drop=True, inplace=True)
officers_list = df["Nama"].tolist()
officers_string = ' '.join(officers_list)
matches = pattern.findall(officers_string)
print(matches)
I tried applying the regex as you can see on the code on the above, but it returns the following:
'ALIF SASETYO,', 'ARIEF HERMAWAN,', 'ARLAN SEPTIA ANANDA RASAM,', 'CHAIRAL TANJUNG,', 'FUAD RIZAL,', 'R AGUS HARYOTO PURNOMO,', 'PT CTCORP INFRASTRUKTUR D INDONESIA,', 'A PT INTRERPORT PATIMBAN AGUNG,', 'PT PATIMBAN MAJU BERSAMA,', 'PT TERMINAL PETIKEMAS SURABAYA,', 'YUKKI NUGRAHAWAN HANAFI,'
I don't want the regex to be returning A and PT, and want to exclude the string that has PT on it. Is there a way to do this through regex?
I think the syntax your looking for is something like [^PT]. As described on another question, you can require that regex's do not match some specified text using that.
Can you fit that into your existing regex? BTW, I really like https://regex101.com/ for testing regex's before putting them to use.
Try this
pattern = re.compile(r'(?!PT\s)([A-Z]+\s[A-Z]+[,])')
import re
input_text = "03:00 am hay 1 Entre la 1:30 y las 2:0, o 01:02 am minuto y salimos, 19:30 pm salimos!! es importante llegar alla antes 20 :30 am, ya que 21:00 pm cierran algunos negocios, sin embargo el cine esta abierto hasta 23:30 pm o 01 : 00 am, 1:00 am 1:00 pm, : p m, 1: pm 5: pm"
This is the regexp prototype to detect a pattern that encompasses the following substrings
civil_time_pattern = r'(\d{1,2})[\s|:]*(\d{0,2})\s*(am|pm)?'
civil_time_unit_list = list(map(re.findall(civil_time_pattern, input_text)))
the substrings it must be able to detect in the original input string: ["03:00 am", "1:30", "2:0", "01:02 am", "19:30 pm", "20 :30 am", "21:00 pm", "23:30 pm", "01 : 00 am", "1:00 am", "1:00 pm", ": p m", "1: pm", "5: pm"]
This is the conversion process that must have each one of the substrings ( hh:mm am or pm ) that it detects within the input_string. One of the problems with this code is how to apply these replacements only in cases where the previous regex is true.
#Block of code that should receive the substring, chunk it and try to correct it, then later replace the corrected version in the original string
if (If the pattern is met... ):
try:
hh = civil_time_unit_list[0][0]
if (hh == ""): hh = "00"
except IndexError: hh = "00"
try:
mm = civil_time_unit_list[0][1]
if (mm == ""): mm = "00"
except IndexError: mm = "00"
try:
am_pm = civil_time_unit_list[0][2]
if (am_pm == ""):
if (int(hh) >= 0 and int(hh) < 12): am_pm = "am"
elif (int(hh) >= 12 and int(hh) < 24): am_pm = "pm"
else:
#If it says pm, the indication pm will be prioritized over the hour that is indicated
#But if it says am the time will be prioritized over the indication of am
if (am_pm == "am"):
if (int(hh) >= 12 and int(hh) < 24): am_pm = "pm"
else: pass
elif (am_pm == "pm"):
if (int(hh) >= 0 and int(hh) < 12): hh = str( int(hh) + 12 )
else: pass
except IndexError:
if (int(hh) >= 0 and int(hh) < 12): am_pm = "am"
elif (int(hh) >= 12 and int(hh) < 24): am_pm = "pm"
#Add "0" in front, if the substring is not 2 characters long
if (len(hh) < 2): hh = "0" + hh
if (len(mm) < 2): mm = "0" + mm
output = hh + ":" + mm + " " + am_pm
output = output.strip()
One of the possible problems is that we do not know how many times that pattern will appear, so I do not know how many times it would have to be extracted and therefore I do not know how many substrings I will have to send to the correction and replacement process, and I also have to consider that the same replacement can occur 2 times (or more).
print(repr(input_text)) #You should now be able to print the original string but with all the replacements already done.
And this is the correct output that I need, as you can see the previous process has been applied on each of the patterns hh:mm am or pm
input_text = "03:00 am hay 1 Entre la 01:30 am y las 02:00 am, o 01:02 am minuto y salimos, 19:30 pm salimos!! es importante llegar alla antes 20:30 pm, ya que 21:00 pm cierran algunos negocios, sin embargo el cine esta abierto hasta 23:30 pm o 01:00 am, 01:00 am 13:00 pm, 00:00 am, 13:00 pm 05:00 pm"
IIUC this is what you want, replace all matched strings by that matched string converted to some other string, you can easily just do it with re.sub by giving it the function that will handle the conversion using the matched group and return it back to be used as the replacement:
input_text = "03:00 am hay 1 Entre la 1:30 y las 2:0, o 01:02 am minuto y salimos, 19:30 pm salimos!! es importante llegar alla antes 20 :30 am, ya que 21:00 pm cierran algunos negocios, sin embargo el cine esta abierto hasta 23:30 pm o 01 : 00 am, 1:00 am 1:00 pm, : p m, 1: pm 5: pm"
civil_time_pattern = re.compile(r"(\d{1,2})[\s|:]*(\d{0,2})\s*(am|pm)?")
def convert(match):
hh = match.group(1) or "00"
mm = match.group(2) or "00"
am_pm = match.group(3)
if not am_pm:
if 0 <= int(hh) < 12:
am_pm = "am"
elif 12 <= int(hh) < 24:
am_pm = "pm"
# If it says pm, the indication pm will be prioritized over the hour that is indicated
# But if it says am the time will be prioritized over the indication of am
if am_pm == "am":
if 12 <= int(hh) < 24:
am_pm = "pm"
elif am_pm == "pm":
if 0 <= int(hh) < 12:
hh = str(int(hh) + 12)
# Add "0" in front, if the substring is not 2 characters long
hh = hh.zfill(2)
mm = mm.zfill(2)
output = f"{hh}:{mm}"
return output
result = civil_time_pattern.sub(convert, input_text)
print(result)
I haved scraped data from Wikipedia and created a dataframe. df[0] contains
{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}} Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.
I want to remove:
{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}}
How can I do this? I have tried
df['Body'] = df['Body'].replace('< ref >.< \/ref > | {{.}} | {{.*=}}','', regex = True)
df['Body'] = df['Body'].str.replace('\'\'\' | \n | [ | ] | \'\'','',regex=True)
but it doest work
This shall do the trick
import re
re.sub('^{{.*}}','', text)
you can apply this function to the column of your dataframe and it will transform the column.
You were very close, why it did not work was because of the extra spacing in your regex pattern, | {{.*=}} considers the space behind the curly spaces. As suggested as the other answer you can use the special operator ^ that anchors at the start of the line.
Else to apply a regex replace that matches that exact pattern then remove the whitespaces in your pattern:
text = '{{Infobox_President |name = Mohammed Anwar Al Sadat < br / > محمد أنورالسادات |nationality = Al Menofeia, Mesir |image = Anwar Sadat cropped.jpg |order = Presiden Mesir ke-3 |term_start = 20 Oktober 1970 |term_end = 6 Oktober 1981 |predecessor = Gamal Abdel Nasser |successor = Hosni Mubarak |birth_date =|birth_place = Mit Abu Al-Kum, Al-Minufiyah, Mesir |death_place = Kairo, Mesir |death_date =|spouse = Jehan Sadat |party = Persatuan Arab Sosialis < br / > (hingga 1977) < br / > Partai Nasional Demokratik (Mesir)|Partai Nasional Demokratik < br / > (dari 1977) |vicepresident =|constituency =}} Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.'
df = pd.DataFrame({'text':[text]})
new_df = df.replace('< ref >.< \/ref >|{{.*}}','', regex = True)
new_df.text[0]
Output:
' Jenderal Besar Mohammed Anwar Al Sadat () adalah seorang tentara dan politikus Mesir. Ia menjabat sebagai Presiden Mesir|Presiden ketiga Mesir pada periode 15 Oktober 1970 hingga terbunuhnya pada 6 Oktober 1981. Oleh dunia Barat ia dianggap sebagai orang yang sangat berpengaruh di Mesir dan di Timur Tengah dalam sejarah modern.'
How can I get a dictionary like below from this text file:
[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester
I have tried list comprehension, for loops with enumerate function. But I could not build this list.
My desired dictionary is:
{'[Fri Aug 20]':[shamooshak 4-0 milan, Tehran 2-0 Ams,Liverpool 0-2 Mes],'[Fri Aug 19]':[Esteghlal 1-0 perspolise,Paris 2-0 perspolise]... and so on.
Assuming your data is lines of text...
def process_arbitrary_text(text):
obj = {}
arr = []
k = None
for line in text:
if line[0] == '[' and line[-1] == ']':
if k and arr: # omit empty keys?
obj[k] = arr
k = line
arr = []
else:
arr.append(line)
return obj
desired_dict = process_arbitrary_text(text)
Edit: since you edited to say it was a text file, just include the following pattern
with open('filename.txt', 'r') as file:
for line in file:
# do something...or:
text = file.readlines()
A for can be a savior here
a='''[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester'''
d={}
temp_value=[]
temp_key=''
for i in a.split('\n'):
if i.startswith('['):
if temp_key and temp_key in d:
d[temp_key]=d[temp_key]+temp_value
elif temp_key:
d[temp_key]=temp_value
temp_key=i
temp_value=[]
else:
temp_value.append(i)
print(d)
Output
{'[Fri Aug 20]': ['shamooshak 4-0 milan', 'Tehran 2-0 Ams', 'Liverpool 0-2 Mes', 'RahAhan 0-0 milan'], '[Fri Aug 19]': ['Esteghlal 1-0 perspolise', 'Paris 2-0 perspolise'], '[Wed Agu 12]': ['RM 0-0 Tarakto'], '[Wed Agu 11]': ['Munich 3-3 ABC']}
Using regular expressions (re module) and your sample text:
text = '''[Fri Aug 20]
shamooshak 4-0 milan
Tehran 2-0 Ams
Liverpool 0-2 Mes
[Fri Aug 19]
Esteghlal 1-0 perspolise
Paris 2-0 perspolise
[Fri Aug 20]
RahAhan 0-0 milan
[Wed Agu 11]
Munich 3-3 ABC
[Wed Agu 12]
RM 0-0 Tarakto
[Sat Jau 01]
Bayern 2-0 Manchester'''
x = re.findall('\[.+?\][^\[]*',text)
x = [i.split('\n') for i in x]
d = dict()
for i in x:
d[i[0]] = [j for j in i[1:] if j!='']
It gives following dictionary d:
`{'[Fri Aug 20]': ['RahAhan 0-0 milan'], '[Sat Jau 01]': ['Bayern 2-0 Manchester'], '[Fri Aug 19]': ['Esteghlal 1-0 perspolise', 'Paris 2-0 perspolise'], '[Wed Agu 12]': ['RM 0-0 Tarakto'], '[Wed Agu 11]': ['Munich 3-3 ABC']}`
I overlooked that dates might repeat, as pointed by mad_, to avoid losing data replace for loop with
for i in x:
d[i[0]] = []
for i in x:
d[i[0]] = d[i[0]]+[j for j in i[1:] if j!='']
I'm trying to create a dataframe from a .data file that's not greatly formatted.
Here's the raw text data:
FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;
My first attempt didn't consider delimiters other than ';'. I used pd.read_table() :
df = pd.read_table("./file.data", sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
This is the result I got :
As you can see, nearly all indexes are shifted, creating empty rows, and putting 'NaN' as index for the rows actually containing the data I want.
I figured this is due to some delimiters looking like this : ; ;.
So I tried giving the sep parameter a regex that matches both cases, ensuring the use of the python engine:
df = pd.read_table("./file.data", sep=';(\s+;)?', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True, engine='python')
But the result is unsatisfying, as you can see below. (I took only a part of the dataframe, but the idea stays the same).
I've tried other slightly different regexes with similar result.
So I would basically like to have the labels indexing the empty rows shifted to one row below. I didn't try directly modifying the file for efficiency matters because I have around a thousand similar files to get into dataframes. For the same reason, I can't juste rename the index, as some file won't have the same number of rows and such.
Is there a way to do this using pandas ? Thanks a lot.
You could do your manipulations after import:
from io import StringIO
import numpy as np
datafile = StringIO("""FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;""")
df = pd.read_table(datafile, sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
df1 = pd.DataFrame(df.values[~df.isnull().all(axis=1),:], index=df.index.dropna()[np.r_[0,2:6]], columns=df.columns)
df_out = df1.dropna(how='all',axis=1)
print(df_out)
Output:
Janv. Févr. Mars Avril \
La température la plus élevée (°C) 16.1 21.4 25.7 30.2
Date 05-1999 28-1960 25-1955 18-1949
Température maximale (Moyenne en °C) 7.2 8.3 12.2 15.6
Température moyenne (Moyenne en °C) 4.9 5.6 8.8 11.5
Température minimale (Moyenne en °C) 2.7 2.8 5.3 7.3
Mai Juin Juil. Août \
La température la plus élevée (°C) 34.8 37.6 40.4 39.5
Date 29-1944 26-1947 28-1947 11-2003
Température maximale (Moyenne en °C) 19.6 22.7 25.2 25
Température moyenne (Moyenne en °C) 15.2 18.3 20.5 20.3
Température minimale (Moyenne en °C) 10.9 13.8 15.8 15.7
Sept. Oct. Nov. Déc. Année
La température la plus élevée (°C) 36.2 28.9 21.6 17.1 40.4
Date 07-1895 01-2011 07-2015 16-1989 1947
Température maximale (Moyenne en °C) 21.1 16.3 10.8 7.5 16
Température moyenne (Moyenne en °C) 16.9 13 8.3 5.5 12.4
Température minimale (Moyenne en °C) 12.7 9.6 5.8 3.4 8.9