Extract the URL from a CSV file - python

Here in a csv file with differents links to jpg pictures : https://drive.google.com/file/d/1rnsjn9D2mSrBULONpg1b1nw4ORu5aa8f/view?usp=sharing
Locally, I have done
import pandas
path = "/home/infinity/Downloads/"
path_1 = path + "fonds-de-la-guerre-14-18-extrait-de-la-base-memoire.csv"
df = pandas.read_csv(path_1)
In using that, I am not able to build a list of those links. How can I do that?

What you can do is the following this. I guess you've already read your csv
VIDEO-p Unnamed: 1
0 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
1 Positif original NaN
2 Positif original NaN
3 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
4 Front roumain Personnalité
... ... ...
18977 Tirage photographique NaN
18978 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18979 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18980 Tirage photographique NaN
18981 Tirage photographique NaN
[18982 rows x 2 columns]
Since, you're looking for all the row that contain the addresses http:// you simply do this:
Liste_des_adresses = df[df["VIDEO-p"].str.contains("https://", na=False)]
which gives
VIDEO-p Unnamed: 1
0 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
3 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
5 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
6 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
8 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
... ... ...
18972 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18973 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18975 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18978 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
18979 https://data.culture.gouv.fr/api/v2/catalog/da... NaN
[10333 rows x 2 columns]
Et voilà!

Related

Adding data points to my dataframe is too slow, is there a faster way with Pandas?

I have the following problem. Let's say that I have a dataframe called file_data, which has 3 columns TS_ns, VALUE_NUMBER and Alias
VALUE_NUMBER Alias TS_ns
0 0.116000 Name_1 3
1 3.448000 Name_2 34
2 6.106000 Name_3 7
3 4.048000 Name_4 54
4 4.358000 Name_5 32
I would like to take its datapoints and add them to a new dataframe, called dataframe_var, which is empty and has only a column called Channel:
Channel
0 Name_1
1 Name_2
2 Name_3
3 Name_4
4 Name_5
In order to obtain this:
Channel 3 34 7 54 32
0 Name_1 116000 nan nan nan nan
1 Name_2 nan 3.448000 nan nan nan
2 Name_3 nan nan 6.106000 nan nan
3 Name_4 nan nan nan 4.048000 nan
4 Name_5 nan nan nan nan 4.358000
and possibly reorder the column names by increasing value.
The procedure I use is the following:
import pandas as pd
time_series = pd.Series( file_data.TS_ns )
value_series = pd.Series( file_data.VALUE_NUMBER )
alias_series = pd.Series( file_data.Alias )
for time_point, value_point, alias_point in zip( time_series, value_series, alias_series ):
dataframe_var.loc[ dataframe_var.loc[ dataframe_var[ "Channel" ] == alias_point ].index[0], time_point ] = value_point
The problem is that this line:
dataframe_var.loc[ dataframe_var.loc[ dataframe_var[ "Channel" ] == alias_point ].index[0], time_point ] = value_point
is really really slow and if I deal with medium (not so big) dataframes I must wait hours until it complete.
Do you know if there is a most efficient and faster way to add data to my dataframe? Thanks.
You can achieve this using .pivot():
file_data.pivot(index='Alias', columns='TS_ns', values='VALUE_NUMBER')
TS_ns 3 7 32 34 54
Alias
Name_1 0.116 NaN NaN NaN NaN
Name_2 NaN NaN NaN 3.448 NaN
Name_3 NaN 6.106 NaN NaN NaN
Name_4 NaN NaN NaN NaN 4.048
Name_5 NaN NaN 4.358 NaN NaN
No need to use a for loop (this is generally very inefficient with pandas dataframes, there is almost always a function which is much faster). If dataframe_var is perhaps used to filter file_data in this process, you can merge the above output onto that to keep only desired Aliases:
dataframe_var.merge(file_data.pivot(index='Alias', columns='TS_ns',
values='VALUE_NUMBER'), on='Alias', how='left')

tabula.read_pdf in python, getting a list variable and can't read it

I am using tabula to extract some data from a pdf, when I read the file, it outputs a list, not a dataframe, and I'm having problems reading the values,
file = "example.pdf"
path = 'data/' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = False)
cliente_raw = tabula.read_pdf(path, pages=1,output_format="dataframe")
print(cliente_raw)
This is the output
[ Beneficiario: Nury García Unnamed: 1 NIT/Cédula:
0 Dirección: Calle 115 #53-74 Apto 307 NaN Ciudad:
1 Referencia Descripción NaN
2 Spectral + Porcelai Perfect Face Kit, -/- NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
39564525 Teléfono: 601 6299329 Unnamed: 5 Unnamed: 6
0 BOGOTA (C/MARCA) País: COLOMBIA NaN NaN
1 Cantidad IVA Valor Unitario NaN Valor Total
2 1 19% 125,210 NaN 125,210
3 NaN Subtotal NaN 125,210
4 NaN IVA NaN 23,790
5 NaN TOTAL NaN 149,000 ]
The len of this variable is 1, so I dont know how to extract the values, any help?

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

Categories