I'm trying to create a dataframe from a .data file that's not greatly formatted.
Here's the raw text data:
FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;
My first attempt didn't consider delimiters other than ';'. I used pd.read_table() :
df = pd.read_table("./file.data", sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
This is the result I got :
As you can see, nearly all indexes are shifted, creating empty rows, and putting 'NaN' as index for the rows actually containing the data I want.
I figured this is due to some delimiters looking like this : ; ;.
So I tried giving the sep parameter a regex that matches both cases, ensuring the use of the python engine:
df = pd.read_table("./file.data", sep=';(\s+;)?', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True, engine='python')
But the result is unsatisfying, as you can see below. (I took only a part of the dataframe, but the idea stays the same).
I've tried other slightly different regexes with similar result.
So I would basically like to have the labels indexing the empty rows shifted to one row below. I didn't try directly modifying the file for efficiency matters because I have around a thousand similar files to get into dataframes. For the same reason, I can't juste rename the index, as some file won't have the same number of rows and such.
Is there a way to do this using pandas ? Thanks a lot.
You could do your manipulations after import:
from io import StringIO
import numpy as np
datafile = StringIO("""FICHE CLIMATOLOGIQUE;
;
Statistiques 1981-2010 et records;
PARIS-MONTSOURIS (75) Indicatif : 75114001, alt : 75m, lat : 48°49'18"N, lon : 02°20'12"E;
Edité le : 18/12/2017 dans l'état de la base;
; Janv.; Févr.; Mars; Avril; Mai; Juin; Juil.; Août; Sept.; Oct.; Nov.; Déc.; Année;
La température la plus élevée (°C);
(Records établis sur la période du 01-06-1872 au 03-12-2017);
; 16.1; 21.4; 25.7; 30.2; 34.8; 37.6; 40.4; 39.5; 36.2; 28.9; 21.6; 17.1; 40.4;
Date ; 05-1999; 28-1960; 25-1955; 18-1949; 29-1944; 26-1947; 28-1947; 11-2003; 07-1895; 01-2011; 07-2015; 16-1989; 1947;
Température maximale (Moyenne en °C);
; 7.2; 8.3; 12.2; 15.6; 19.6; 22.7; 25.2; 25; 21.1; 16.3; 10.8; 7.5; 16;
Température moyenne (Moyenne en °C);
; 4.9; 5.6; 8.8; 11.5; 15.2; 18.3; 20.5; 20.3; 16.9; 13; 8.3; 5.5; 12.4;
Température minimale (Moyenne en °C);
; 2.7; 2.8; 5.3; 7.3; 10.9; 13.8; 15.8; 15.7; 12.7; 9.6; 5.8; 3.4; 8.9;""")
df = pd.read_table(datafile, sep=';', index_col=0, skiprows=7, header=0, skip_blank_lines=True, skipinitialspace=True)
df1 = pd.DataFrame(df.values[~df.isnull().all(axis=1),:], index=df.index.dropna()[np.r_[0,2:6]], columns=df.columns)
df_out = df1.dropna(how='all',axis=1)
print(df_out)
Output:
Janv. Févr. Mars Avril \
La température la plus élevée (°C) 16.1 21.4 25.7 30.2
Date 05-1999 28-1960 25-1955 18-1949
Température maximale (Moyenne en °C) 7.2 8.3 12.2 15.6
Température moyenne (Moyenne en °C) 4.9 5.6 8.8 11.5
Température minimale (Moyenne en °C) 2.7 2.8 5.3 7.3
Mai Juin Juil. Août \
La température la plus élevée (°C) 34.8 37.6 40.4 39.5
Date 29-1944 26-1947 28-1947 11-2003
Température maximale (Moyenne en °C) 19.6 22.7 25.2 25
Température moyenne (Moyenne en °C) 15.2 18.3 20.5 20.3
Température minimale (Moyenne en °C) 10.9 13.8 15.8 15.7
Sept. Oct. Nov. Déc. Année
La température la plus élevée (°C) 36.2 28.9 21.6 17.1 40.4
Date 07-1895 01-2011 07-2015 16-1989 1947
Température maximale (Moyenne en °C) 21.1 16.3 10.8 7.5 16
Température moyenne (Moyenne en °C) 16.9 13 8.3 5.5 12.4
Température minimale (Moyenne en °C) 12.7 9.6 5.8 3.4 8.9
Related
import re, datetime, time
input_text = "tras la aparicion del objeto misterioso el 2022-12-30 visitamos ese sitio nuevamente revisando detras de los arboles pero recien tras 3 dias ese objeto aparecio de nuevo tras 2 arboles" #example 1
input_text = "luego el 2022-11-15 fuimos nuevamente a dicho lugar pero nada ocurrio y 3 dias despues ese objeto aparecio en el cielo durante el atardecer de aquel dia de verano" #example 2
input_text = "Me entere del misterioso suceso ese 2022-11-01 y el 2022-11-15 fuimos al monte pero nada ocurrio, pero luego de 13 dias ese objeto aparecio en el cielo" #example 3
identified_referencial_date = r"(?P<year>\d*)-(?P<month>\d{2})-(?P<startDay>\d{2})" #obtained with regex capture groups
# r"(?:luego[\s|]*de[\s|]*unos|luego[\s|]*de|pasados[\s|]*ya[\s|]*unos|pasados[\s|]*unos|pasados[\s|]*ya|pasados|tras[\s|]*ya|tras)[\s|]*\d*[\s|]*(?:días|dias|día|dia)"
# r"\d*[\s|]*(?:días|dias|día|dia)[\s|]*(?:despues|luego)"
n = #the number of days that in this case should increase
indicated_date_relative_to_another = str(identified_date_in_date_format - datetime.timedelta(days = int(n) ))
input_text = re.sub(identified_referencial_date, indicated_date_relative_to_another, input_text)
print(repr(input_text)) # --> output
The objective is that if a day is indicated first in the format year-month-day (are integers separated by hyphens in that order) \d*-\d{2}-\d{2} and then it says that n amount of days have passed, so you would have to replace that sentence with year-month-day+n
luego de unos 3 dias ---> add 3 days to a previous date
luego de 6 dias ---> add 6 days to a previous date
pasados ya 13 dias ---> add 13 days to a previous date
pasados ya unos 48 dias ---> add 48 days to a previous date
pasados unos 36 dias ---> add 36 days to a previous date
pasados 9 dias ---> add 9 days to a previous date
tras ya 2 dias ---> add 2 days to a previous date
tras 32 dias ---> add 32 days to a previous date
3 dias despues ---> add 3 days to a previous date
3 dias luego ---> add 3 days to a previous date
Keep in mind that in certain cases, increasing the number of days could also change the number of the month or even the year, as in example 1.
Outputs that I need obtain in each case:
"tras la aparicion del objeto misterioso el 2022-12-30 visitamos ese sitio nuevamente revisando detras de los arboles pero recien 2023-01-02 ese objeto aparecio de nuevo tras 2 arboles" #for the example 1
"luego el 2022-11-15 fuimos nuevamente a dicho lugar pero nada ocurrio y 2022-11-18 ese objeto aparecio en el cielo durante el atardecer de aquel dia de verano" #for the example 2
"Me entere del misterioso suceso ese 2022-11-01 y el 2022-11-15 fuimos al monte pero nada ocurrio, pero 2022-11-28 ese objeto aparecio en el cielo" #for the example 3
Here is a regex solution you could use:
([12]\d{3}-[01]\d-[0-3]\d)(\D*?)(?:(?:luego de|pasados|tras)(?: ya)?(?: unos)? (\d+) dias|(\d+) dias (?:despues|luego))
This regex requires that there are no other digits between the date and the days. It also is a bit loose on grammar. It would also match "luego de ya 3 dias". You can of course make it more precise with a longer regex, but you get the picture.
In a program:
from datetime import datetime, timedelta
import re
def add(datestr, days):
return (datetime.strptime(datestr, "%Y-%m-%d")
+ timedelta(days=int(days))).strftime('%Y-%m-%d')
input_texts = [
"tras la aparicion del objeto misterioso el 2022-12-30 visitamos ese sitio nuevamente revisando detras de los arboles pero recien tras 3 dias ese objeto aparecio de nuevo tras 2 arboles",
"luego el 2022-11-15 fuimos nuevamente a dicho lugar pero nada ocurrio y 3 dias despues ese objeto aparecio en el cielo durante el atardecer de aquel dia de verano",
"Me entere del misterioso suceso ese 2022-11-01 y el 2022-11-15 fuimos al monte pero nada ocurrio, pero luego de 13 dias ese objeto aparecio en el cielo"
]
for input_text in input_texts:
result = re.sub(r"([12]\d{3}-[01]\d-[0-3]\d)(\D*?)(?:(?:luego de|pasados|tras)(?: ya)?(?: unos)? (\d+) dias|(\d+) dias (?:despues|luego))",
lambda m: m[1] + m[2] + add(m[1], m[3] or m[4]),
input_text)
print(result)
When using BeautifulSoup to scrap a table from https://egov.uscis.gov/processing-times/historic-pt, instead of getting the values that can bee seen in the content of the table, I get what seems to be a call from some sort of database:
table = webpage.select("table.records")
table
df = pd.read_html(str(table), na_values=0)[0]
df
Form Form Description Classification or Basis for Filing FY 2017 FY 2018 FY 2019 FY 2020 FY 2021 FY 20225
0 ${data.FORM_NAME} ${data.FORM_TITLE_EN} ${data.FORM_DESC_EN} ${data.FY14} ${data.FY15} ${data.FY16} ${data.FY17} ${data.NAT_AVG_MONTHS} ${data.FY22}
When I inspect the table with F12 I can see several <tr with the content that I wish to scrap; however, when I look at the source code, what I see is what I suspect is a call to a database:
<tbody class="recordsBody">
<tr v-for="data in histFormsData">
<th scope="row" style="border-right:1px solid black; font-weight:bold">${data.FORM_NAME}</th>
<th scope="row">${data.FORM_TITLE_EN}</th>
<th scope="row" style="border-right:1px solid black">${data.FORM_DESC_EN}</th>
<td style="border-right:1px solid black; text-align:center">${data.FY14}</td>
<td style="border-right:1px solid black; border-right:1px solid black;text-align:center">${data.FY15}</td>
<td style="border-right:1px solid black; text-align:center">${data.FY16}</td>
<td style="border-right:1px solid black; text-align:center">${data.FY17}</td>
<td style="border-right:1px solid black; text-align:center">${data.NAT_AVG_MONTHS}</td>
<td style="text-align:center">${data.FY22}</td>
</tr>
</tbody>
This is the code I used to get request the webpage:
#Load webpage content
r = requests.get("https://egov.uscis.gov/processing-times/historic-pt")
#Convert to beautiful soup object
webpage = bs(r.content)
print(webpage.prettify())
What can I do to get the row content that can be seen in the page? I am new to web scraping and I was not able to find my question online.
Thanks in advance.
I tried importing the required packages, request the webpage, and use pandas to get the table:
#Import important packages
import requests # this one is for accessing webpages
from bs4 import BeautifulSoup as bs #scraping tool
import pandas as pd #pandas
#Load webpage content
r = requests.get("https://egov.uscis.gov/processing-times/historic-pt")
#Convert to beautiful soup object
webpage = bs(r.content)
print(webpage.prettify())
#Scraping table with pandas
table = webpage.select("table.records")
table
df = pd.read_html(str(table), na_values=0)[0]
df
The data is loaded from different URL so beautifulsoup doesn't see it, try:
import requests
import pandas as pd
url = "https://egov.uscis.gov/processing-times/historical-forms-data"
df = pd.DataFrame(requests.get(url, verify=False).json())
print(df)
Prints:
FORM_NUMBER FORM_NAME FORM_NAME_ES FORM_TITLE_EN FORM_TITLE_ES FORM_DESC_EN FORM_DESC_ES FY14 FY15 FY16 FY17 NAT_AVG_MONTHS FY22
0 I90 I-90 I-90 Application to Replace Permanent Resident Card Solicitud para Reemplazar Tarjeta de Residente Permanente Initial issuance, replacement or renewal Emisión inicial, reemplazo o renovación 6.8 8.0 7.8 8.3 5.2 1.2
1 I102 I-102 I-102 Application for Replacement/Initial Nonimmigrant Arrival/Departure Record Solicitud para Reemplazar o Registro Inicial de Entrada / Salida de No Inmigrante Initial issuance or replacement of a Form I-94 Emisión inicial o reemplazo de un Formulario I-94 4.9 3.9 3.3 3.9 4.0 7.8
2 I129 I-129 I-129 Petition for a Nonimmigrant Worker Petición de Trabajador No Inmigrante Nonimmigrant Petition (Premium filed) Petición de No Inmigrante (con Procesamiento Prioritario) 0.4 0.4 0.4 0.4 0.3 0.3
3 I129 I-129 I-129 Petition for a Nonimmigrant Worker Petición de Trabajador No Inmigrante Nonimmigrant Petition (non Premium filed) Petición de No Inmigrante (sin Procesamiento Prioritario) 3.4 3.8 4.7 2.3 1.8 2.3
4 I129F I-129F I-129F Petition for Alien Fiancé(e) Petición de Prometido(a) Extranjero(a) All Classifications Todas las Clasificaciones 3.6 6.5 5.2 4.6 8.0 12.1
5 I130 I-130 I-130 Petition for Alien Relative Petición de Familiar Extranjero Immediate Relative Familiar Inmediato 6.5 7.6 8.6 8.3 10.2 10.3
6 I131 I-131 I-131 Application for Travel Document Solicitud de Documento de Viaje Advance Parole Document Documento de Permiso Adelantado 3.0 3.6 4.5 4.6 7.7 7.3
7 I131 I-131 I-131 Application for Travel Document Solicitud de Documento de Viaje Parole in Place Permiso de Permanencia en el País 2.5 3.3 3.3 4.8 4.9 4.7
8 I131 I-131 I-131 Application for Travel Document Solicitud de Documento de Viaje Travel Document Documento de Viaje 4.2 2.9 2.8 4.0 7.2 10.6
9 I140 I-140 I-140 Immigrant Petition for Alien Workers Petición de Trabajador Inmigrante Extranjero Immigrant Petition (Premium filed) Petición de Inmigrante (con Procesamiento Prioritario) 0.4 0.3 0.3 0.3 0.4 0.3
10 I140 I-140 I-140 Immigrant Petition for Alien Workers Petición de Trabajador Inmigrante Extranjero Immigrant Petition (non Premium filed) Petición de Inmigrante (sin Procesamiento Prioritario) 7.3 8.9 5.8 4.9 8.2 9.3
...
These are the input string examples:
#example 1.1
colloquial_hour = "Hola nos vemos a las diez y veinte a m, ten en cuenta que al amanecer tendremos que estar despiertos, porque debemos estar alli a eso de nueve a m o las diez y cuarto a m"
#example 1.2
colloquial_hour = "A mi me parece entre las 10 15 am y las 11 a m, o a las 15 a m aunque quizas a medianoche este bien a eso de las 00:00 a m"
#example 1.3
colloquial_hour = "Puede que a las 10 am. Hay 10 a medias, a m mmm... creo que en 10 estarian para terminar a las 11:00 hs a m 11:59 a m"
#example 1.4
colloquial_hour = "Amediados a mediados del 30 antes de y dia; me parace que hay que estar en casa. Medianamente a, mediados de las 05 a m o cerca de 6 a m."
I have tried with a simple replacement, but I think that the cases must be further restricted with a regex pattern so that unwanted replacements are not made...
colloquial_hour = colloquial_hour.replace('a m', 'am ')
, and to be able to obtain this string as output...
the correct output for each of these examples:
#example 1.1
colloquial_hour = "Hola nos vemos a las diez y veinte am, ten en cuenta que al amanecer tendremos que estar despiertos, porque debemos estar alli a eso de nueve am o las diez y cuarto am"
#example 1.2
colloquial_hour = "A mi me parece entre las 10 15 am y las 11 am, o a las 15 am aunque quizas a medianoche este bien a eso de las 00:00 am"
#example 1.3
colloquial_hour = "Puede que a las 10 am. Hay 10 a medias, a m mmm... creo que en 10 estarian para terminar a las 11:00 hs am 11:59 am"
#example 1.4
colloquial_hour = "Amediados a mediados del 30 antes de y dia; me parace que hay que estar en casa. Medianamente a, mediados de las 05 am o cerca de 6 am."
In this case, the pseudo-pattern is:
some number "a m" to replace with the string "am" one or more empty spaces, a period, a comma or directly the end of the string
Cases should also be considered where there may be incompletely written schedules where "am" would be preceded by ":", " :", ": ", " hs", "hs", "hs ", " h.s. ", "h.s.", "h.s. ", " h.s", "h.s" or "h.s " , for example,
input_t = "a las 12: a m"
output = "a las 12: am"
input_t = "a las 12 : a m"
output = "a las 12 : am"
input_t = "a las 12 hs a m"
output = "a las 12 hs am"
input_t = "a las 12:hs a m"
output = "a las 12:hs am"
input_t = "a las 12: hs a m"
output = "a las 12: hs am"
input_t = "a las 12hsa m"
output = "a las 12hs am"
input_t = "a las 12h.sa m"
output = "a las 12h.s am"
input_t = "a las 12 h.sa m"
output = "a las 12 h.s am"
input_t = "a las 12 h.s.a m"
output = "a las 12 h.s. am"
For the first part I made this regex:
out = re.sub(r"([0-9][0-9]\W)a m(\W|\b)", r"\1am\2", colloquial_hour)
It change the "a m" for "am" keeping whatever was before and after.
For the "hs" or "h.s" I did this:
out = re.sub(r"(hs|h.s)(\.)?\W*a m(\W|\b)", r"\1\2 am\3", out)
It search for "hs", "h.s" before "a m". You can combine both regex, they are pretty similar or use them sequentially.
Let me know if there is any problem.
After downloading Facebook data, they provide json files with your post information. I read the json and dataframe with pandas. Now I want to count the characters of every post I made. The posts are in: df['data'] like: [{'post': 'Happy bday Raul'}].
I want the output to be the count of characters of: "Happy bday Raul" which will be 15 in this case or 7 in the case of "Morning" from [{'post': 'Morning'}].
df=pd.read_json('posts_1.json')
The columns are Date and Data with this format:
Date Data
01-01-2020 *[{'post': 'Morning'}]*
10-03-2020 *[{'post': 'Happy bday Raul'}]*
17-03-2020 *[{'post': 'This lockdown is sad'}]*
I tried to count the characters of this [{'post': 'Morning'}] by doing this
df['count']=df['data'].str.len()
But it's not working as result in "1".
I need to extract the value of the dictionary and do the len to count the characters. The output will be:
Date Data COUNT
01-01-2020 *[{'post': 'Morning'}]* 5
10-03-2020 *[{'post': 'Happy bday Raul'}]* 15
17-03-2020 *[{'post': 'This lockdown is sad'}]* 20
EDITED:
Used to_dict()
df11=df_post['data'].to_dict()
Output
{0: [{'post': 'Feliz cumpleaños Raul'}],
1: [{'post': 'Muchas felicidades Tere!!! Espero que todo vaya genial y siga aún mejor! Un beso desde la Escandinavia profunda'}],
2: [{'post': 'Hola!\nUna investigadora vendrá a finales de mayo, ¿Alguien tiene una habitación libre en su piso para ella? Many Thanks!'}],
3: [{'post': '¿Cómo va todo? Se que muchos estáis o estábais por Galicia :D\n\nOs recuerdo, el proceso de Matriculación tiene unos plazos concretos: desde el lunes 13 febrero hasta el viernes 24 de febrero.'}]
}
You can access the value of the post key for each row using list comprehension and count the length with str.len():
In one line of code, it would look like this:
df[1] = pd.Series([x['post'] for x in df[0]]).str.len()
This would also work, but I think it would be slower to execute:
df[1] = df[0].apply(lambda x: x['post']).str.len()
Full reproducible code below:
df = pd.DataFrame({0: [{'post': 'Feliz cumpleaños Raul'}],
1: [{'post': 'Muchas felicidades Tere!!! Espero que todo vaya genial y siga aún mejor! Un beso desde la Escandinavia profunda'}],
2: [{'post': 'Hola!\nUna investigadora vendrá a finales de mayo, ¿Alguien tiene una habitación libre en su piso para ella? Many Thanks!'}],
3: [{'post': '¿Cómo va todo? Se que muchos estáis o estábais por Galicia :D\n\nOs recuerdo, el proceso de Matriculación tiene unos plazos concretos: desde el lunes 13 febrero hasta el viernes 24 de febrero.'}]
})
df = df.T
df[1] = [x['post'] for x in df[0]]
df[2] = df[1].str.len()
df
Out[1]:
0 \
0 {'post': 'Feliz cumpleaños Raul'}
1 {'post': 'Muchas felicidades Tere!!! Espero qu...
2 {'post': 'Hola!
Una investigadora vendrá a fi...
3 {'post': '¿Cómo va todo? Se que muchos está...
1 2
0 Feliz cumpleaños Raul 22
1 Muchas felicidades Tere!!! Espero que todo vay... 112
2 Hola!\nUna investigadora vendrá a finales de ... 123
3 ¿Cómo va todo? Se que muchos estáis o está... 195
I have a df which looks like this:
Label PG & PCF EE-LV Centre UMP FN
Label
Très favorable 2.68 3.95 4.20 3.33 3.05
Plutôt favorable 12.45 42.10 19.43 2.05 1.77
Plutôt défavorable 43.95 41.93 34.93 20.15 15.97
Très défavorable 37.28 9.11 41.44 70.26 75.99
Je ne sais pas 3.63 2.91 0.10 4.21 3.22
I would simply like to replace "&" with "and" where found, either at the column labels or index labels.
I would've imaged that something like this wouldve worked but it didn't...
dataframe.columns.replace("&","and")
Any ideas?
What you tried won't work as Index has no such attribute, however what we can do is convert the column into a Series and then use str and replace to do what you want, you should be able to do the analagous operation on the index too:
df.columns = pd.Series(df.columns).str.replace('&', 'and')
df
Out[307]:
PG and PCF EE-LV Centre UMP
Label
Très favorable 2.68 3.95 4.20 3.33
Plutôt favorable 12.45 42.10 19.43 2.05
Plutôt défavorable 43.95 41.93 34.93 20.15
Très défavorable 37.28 9.11 41.44 70.26
Je ne sais pas 3.63 2.91 0.00 4.21