pandas data mining from Eurostat

pandas data mining from Eurostat - python

I'm starting a work to analyse data from Stats Institutions like Eurostat using python, and so pandas. I found out there are two methods to get data from Eurostat.
pandas_datareader: it seems very easy to use but I found some problems to get some specific data
pandasdmx: I've found it a bit complicated but it seems a promising solution, but documentation is poor
I use a free Azure notebook, online service, but I don't think it will complicate more my situation.
Let me explain the problems for pandas_datareader. According to the pandas documentation, in the section API, there is this short documented package and it works. Apart from the shown example, that nicely works, a problem arises about other tables. For example, I can get data about European house price, which ID table is prc_hpi_a with this simple code:
import pandas_datareader.data as web
import datetime
df = web.DataReader('prc_hpi_a', 'eurostat')
But the table has three types of data about dwellings: TOTAL, EXISTING and NEW. I got only Existing dwellings and I don't know how to get the other ones. Do you have a solution for these types of filtering.
Secondly there is the path using pandasdmx. Here it is more complicated. My idea is to upload all data to a pandas DataFrame, and then I can analyse as I want. Easy to say, but I've not find many tutorials that explain this passage: upload data to pandas structures. For example, I found this tutorial, but I'm stuck to the first step, that is instantiate a client:
import pandasdmx
from pandasdmx import client
#estat=client('Eurostat', 'milk.db')
and it returns:
--------------------------------------------------------------------------- ImportError Traceback (most recent call
last) in ()
1 import pandasdmx
----> 2 from pandasdmx import client
3 estat=client('Eurostat', 'milk.db')
ImportError: cannot import name 'client'
What's the problem here? I've looked around but no answer to this problem
I also followed this tutorial:
from pandasdmx import Request
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_une_rt_a').write()
metadata.codelist.iloc[8:18]
resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})
data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')
data.columns.names
data.columns.levels
data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
I got the data, but my purpose is to upload them to a pandas structure (Series, DataFrame, etc..), so I can handle easily according to my work. How to do that?
Actually I did with this working line (below the previous ones):
s=pd.DataFrame(data)
But it doesn't work if I try to get other data tables. Let me explain with another example about the Harmonized Index Current Price table:
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_prc_hicp_midx').write()
resp = estat.data('prc_hicp_midx')
data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
It returns an error here, that is:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in ()
2 metadata = estat.datastructure('DSD_prc_hicp_midx').write()
3 resp = estat.data('prc_hicp_midx')
----> 4 data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
5 #metadata.codelist
6 #data.loc[:, ('TOTAL', 'INX_Q','EA', 'Q')]
~/anaconda3_501/lib/python3.6/site-packages/pandasdmx/api.py in
getattr(self, name)
622 Make Message attributes directly readable from Response instance
623 '''
--> 624 return getattr(self.msg, name)
625
626 def _init_writer(self, writer):
AttributeError: 'DataMessage' object has no attribute 'data'
Why does it do not get data now? What's wrong now?
I lost almost a day looking around for some clear examples and explanations. Do you have some to propose? Is there a full and clear documentation? I found also this page with other examples, explaining the use of categorical schemes, but it is not for Eurostat (as explained at some point)
Both methods could work, apart from some explained issues, but I need also a suggestion to have a definitely method to use, to query Eurostat but also many other institutions like OECD, World Bank, etc...
Could you guide me to a definitive and working solution, even if it is different for each institution?

That's my definitive answer to my question that works for each type of data collected from Eurostat. I post here because it can be useful for many.
Let me propose some examples. They produce three pandas series (EU_unempl,EU_GDP,EU_intRates) with data and correct time indexes
#----Unemployment Rate---------
dataEU_unempl=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/ei_lmhr_m?geo=EA&indic=LM-UN-T-TOT&s_adj=NSA&unit=PC_ACT',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range(int(sorted(dataEU_unempl['value'].keys())[0]),1+int(sorted(dataEU_unempl['value'].keys(),reverse=True)[0])):
x=numpy.append(x,dataEU_unempl['value'][str(i)])
EU_unempl=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_unempl['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_unempl['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M')) #'1/1993'
#----GDP---------
dataEU_GDP=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/namq_10_gdp?geo=EA&na_item=B1GQ&s_adj=NSA&unit=CP_MEUR',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_GDP['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_GDP['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_GDP['value'][str(i)])
EU_GDP=pd.Series(x,index=pd.date_range((pd.Timestamp(sorted(dataEU_GDP['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_GDP['value'].keys())[0])])), periods=len(x), freq='Q'))
#----Money market interest rates---------
dataEU_intRates=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/irt_st_m?geo=EA&intrt=MAT_ON',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_intRates['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_intRates['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_intRates['value'][str(i)])
EU_intRates=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_intRates['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_intRates['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M'))

The general solution is to not rely on overly-specific APIs like datareader and instead go to the source. You can use datareader's source code as inspiration and as a guide for how to do it. But ultimately when you need to get data from a source, you may want to directly access that source and load the data.
One very popular tool for HTTP APIs is requests. You can easily use it to load JSON data from any website or HTTP(S) service. Once you have the JSON, you can load it into Pandas. Because this solution is based on general-purpose building blocks, it is applicable to virtually any data source on the Web (as opposed to e.g. pandaSDMX, which is only applicable to SDMX data sources).

Load with read_csv and multiple separators
The problem with eurostat data from the bulk download repository is that they are tab separated files where the first 3 columns are separated by commas. Pandas read_csv() can deal with mulitple separators as a regex if you specify engine="python". This works for some data sets, but the OP's dataset also contains flags, which cannot be ignored in the last column.
# Load the house price index from the Eurostat bulk download facility
import pandas
code = "prc_hpi_a"
url = f"https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F{code}.tsv.gz" # Pandas.read_csv could almost read it directly with a multiple separator
df = pandas.read_csv(url, sep=",|\t| [^ ]?\t", na_values=":", engine="python")
# But the last column is a character column instead of a numeric because of the
# presence of a flag ": c" illustrated in the last line of the table extract
# below
# purchase,unit,geo\time\t 2006\t 2005
# DW_EXST,I10_A_AVG,AT\t :\t :
# DW_EXST,I10_A_AVG,BE\t 83.86\t 75.16
# DW_EXST,I10_A_AVG,BG\t 87.81\t 76.56
# DW_EXST,I10_A_AVG,CY\t :\t :
# DW_EXST,I10_A_AVG,CZ\t :\t :
# DW_EXST,I10_A_AVG,DE\t100.80\t101.10
# DW_EXST,I10_A_AVG,DK\t113.85\t 91.79
# DW_EXST,I10_A_AVG,EE\t156.23\t 98.69
# DW_EXST,I10_A_AVG,ES\t109.68\t :
# DW_EXST,I10_A_AVG,FI\t : c\t : c
Load with the eurostat package
There is also a python package called eurostat which makes it possible to search and load data set from the bulk facility into pandas data frames.
Load 2 different monthly exchange rate data sets:
import eurostat
df1 = eurostat.get_data_df(code)
The table of content of the bulk download facility can be read with
toc_url = "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.txt"
toc2 = pandas.read_csv(toc_url, sep="\t")
# Remove white spaces at the beginning and end of strings
toc2 = toc2.applymap(lambda x: x.strip() if isinstance(x, str) else x)
or with
toc = eurostat.get_toc_df()
toc0 = (eurostat.subset_toc_df(toc, "exchange"))
The last line searches for the datasets that have "exchange" in their title
Reshape to long format
It might be useful to reshape the eurostat data to long format
with
if any(df.columns.str.contains("time")):
time_column = df.columns[df.columns.str.contains("time")][-1]
# Id columns are before the time columns
id_columns = df.loc[:, :time_column].columns
df = df.melt(id_vars=id_columns, var_name="period", value_name="value")
# Remove "\time" from the rightmost column of the index
df = df.rename(columns=lambda x: re.sub(r"\\time", "", x))

Related

How do you get a detailed description text on a given patent?

I am looking at PatentsView API and it's unclear how to retrieve the full text of a patent. It contains only the detail_desc_length not the actual detailed description.
I would like to preform the following on both the patent_abstract and the "detailed_desciption".
import httpx
url = 'https://api.patentsview.org/patents/query?q={"_and": [{"_gte":{"patent_date":"2001-01-01"}},{"_text_any":{"patent_abstract":"radar aircraft"}},{"_neq":{"assignee_lastknown_country":"US"}}]}&o:{"per_page": 1000}'
r=httpx.get(url)
r.json()

You should take a look at patent_client! It's a python module that searches the live USPTO and EPO databases using a Django-style API. The results from any query can then be cast into pandas DataFrames or Series with a simple .to_pandas() call.
from patent_client import Patent
result = Patent.objects.filter(issue_date__gt="2001-01-01", abstract="radar aircraft")
# That provides an iterator of Patent objects that match the query.
# You can grab abstracts and detailed descriptions like this:
for patent in result:
patent.abstract
patent.description
# or you can just load it up in a Pandas dataframe:
result.values("publication_number", "abstract", "description").to_pandas()
# Produces a Pandas dataframe with columns for the patent number, abstract, and description.
A great place to start is the User Guide Introduction
PyPI | GitHub | Docs
(Full disclosure - I'm the author and maintainer of patent_client)

Creating a simple webpage with Python, where template content is populated from a database (or a pandas dataframe) based on query

I use python mainly for data analysis, so I'm pretty used to pandas. But apart from basic HTML, I've little experience with web development.
For work I want to make a very simple webpage that, based on the address/query, populates a template page with info from an SQL database (even if it has to be in a dataframe or CSV first that's fine for now). I've done searches but I just don't know the keywords to ask (hence sorry if this a duplicate or the title isn't as clear as it could be).
What I'm imagining (most simple example, excuse my lack of knowledge here!). Example dataframe:
import pandas as pd
df = pd.DataFrame(index=[1,2,3], columns=["Header","Body"], data=[["a","b"],["c","d"],["e","f"]])
Out[1]:
Header Body
1 a b
2 c d
3 e f
User puts in page, referencing the index 2:
"example.com/database.html?id=2" # Or whatever the syntax is.
Output-page: (Since id=2, takes data row data from index = 2, so "c" and "d")
<html><body>
Header<br>
c<p>
Body<br>
d<p>
</body></html>
It should be pretty simple right? But where do I start? Which Python library? I hear about Django and Flask, but are they overkill for this? Is there an example I could follow? And lastly, how does the syntax work for the webpage address?
Cheers!
PS: I realise I should probably just query the SQL database directly and cut out the pandas middle-man, just I'm more familiar with pandas hence the example above.
Edit: I a word.

You can start with flask, It is easy to setup and lots of good resources online,
Start with this minimal web app http://flask.pocoo.org/docs/1.0/quickstart/
Example snippet
#app.route('/database')
def database():
id = request.args.get('id') #if key doesn't exist, returns None
df = pd.DataFrame(index=[1,2,3], columns=["Header","Body"], data=[["a","b"],["c","d"],["e","f"]])
header = df[id].get("Header")
body = df[id].get("Body")
return '''<html><body>Header<br>{}<p>Body<br>d<p></body></html>'''.format(header, body)
For more detailed webpage add a template.
Good luck

r reticulate: rename duplicates from converted Python pandas dataframe

I am using the great new r package "reticulate" to merge Python and R to be able to use an API from a data provider (Thomson Reuters Eikon) in R which is only available for Python. I wish to do that as my R abilities are better than my (almost non-existent) Python abilities.
I use a function "get_news_headlines" from the Python module "eikon" which serves as the API to download data from Thomson Reuters Eikon. I automatically convert the resulting pandas dataframe to an r dataframe by setting the argument "convert" of the reticulate function "import" to TRUE.
The API sets the first column of the downloaded data containing the news publication dates as the index. When the dataframe is converted to an r object automatically there are duplicates in the dates and I receive the following error message:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘2018-05-31 08:21:56’
Here is my code:
library(reticulate) #load reticulate package to combine Python with R
PYTHON_pandas <- import("pandas", convert = TRUE)
#import Python pandas via reticulate's function "import"
PYTHON_eikon <- import("eikon", convert = TRUE)
#import the Thomson Reuters API Python module for use of the API in R
#(I set convert to true to convert Python objects into their R equivalents)
#do not bother with the following line:
PYTHON_eikon$set_app_id('ADD EIKON APP ID HERE')
#set a Thomson Reuters application ID (step is necessary to download data from TR, any string works)
DF <- PYTHON_eikon$get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L)
#save news data from the API in an R dataframe
#query is the Thomson Reuters code from their Eikon database
#count is the number of news articles to be downloaded, I arbitrarily set it to 10 articles here
So my problem is that I have to tell R to replace the duplicates from the pandas index before the conversion into an r dataframe happens to avoid the stated error message. When I set the argument count to a small number and coincidentally do not have any duplicates, the code works perfectly fine as it is now.
This is probably easy for people with some knowledge both in R and Python (so not for me as my Python knowledge is very limited). Unfortunately the code is not replicable as I want to use the Thomson Reuters data access.
Any help is highly appreciated!
EDIT:
Would it possibly be an option to set the argument convert = FALSE in the import function to receive a pandas dataframe in R first? Than I would require a possibility to manipulate the Python pandas dataframe within R so that the duplicates are removed or alternatively the pandas dataframe index is removed before I manually convert the pandas dataframe to an R dataframe. Is that possible with reticulate?
The documentation for the eikon Python package is not really good yet as it is a pretty new Python module.
#Moody_Mudskipper:
str(PYTHON_eikon) only returns Module(eikon) as I am only fetching the respective Python module with the import function.
names(PYTHON_eikon) returns:
"data_grid" "eikonError" "EikonError" "get_app_id" "get_data" "get_news_headlines" "get_news_story" "get_port_number" "get_symbology" "get_timeout" "get_timeseries" "json_requests" "news_request" "Profile" "send_json_request" "set_app_id" "set_port_number" "set_timeout" "symbology" "time_series" "tools" "TR_Field"
None of the available eikon functions seems to help me with my issue.

In case this rather special problem is ever interesting for someone else, I briefly want to share the solution I found in the meantime (not perfect, but working):
library(reticulate) #load reticulate package to combine Python with R
PYTHON_pandas <- import("pandas", convert = TRUE)
#import Python pandas via reticulate's function "import"
PYTHON_eikon <- import("eikon", convert = TRUE)
#import the Thomson Reuters API Python module for use of the API in R
#(I set convert to true to convert Python objects into their R equivalents)
#do not bother with the following line:
PYTHON_eikon$set_app_id('ADD EIKON APP ID HERE')
#set a Thomson Reuters application ID (step is necessary to download data from TR, any string works)
#**Solution starts HERE:**
DF <- PYTHON_eikon$get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L, raw_output = TRUE)
#use argument "raw_output" to receive a list instead of a dataframe
DF[c(2, 3)] <- NULL
#delete unrequired list-elements
DF <- list.cbind(DF)
#use "rlist" function "list.cbind" to column-bind list object "DF"
DF <- rbindlist(DF, fill = FALSE)
#use "data.table" function "rbindlist" to row-bind list object "DF"

Do you need to use the R package "reticulate" or could you look at other packages as well?
There is an open source wrapper for R available at GitHub: eikonapir. Although it is not officially supported, you might find it useful because it can execute your command without any problems:
get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L)
**Disclaimer: I am currently employed by Thomson Reuters

Unable to run Stanford Core NLP annotator over whole data set

I have been trying to use Stanford Core NLP over a data set but it stops at certain indexes which I am unable to find.
The data set is available on Kaggle: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data
This is a function that outputs the sentiment of a paragraph by taking the mean sentiment value of individual sentences.
import json
def funcSENT(paragraph):
all_scores = []
output = nlp.annotate(paragraph, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only takes in one single sentence.
#"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process faster
"enforceRequirements": "false"
})
all_scores = []
for i in range(0,len(output['sentences'])):
all_scores.append((int(json.loads(output['sentences'][i]['sentimentValue']))+1))
final_score = sum(all_scores)/len(all_scores)
return round(final_score)
Now I run this code for every review in the 'Reviews' column using this code.
import pandas as pd
data_file = 'C:\\Users\\SONY\\Downloads\\Amazon_Unlocked_Mobile.csv'
data = pd.read_csv( data_file)
from pandas import *
i = 0
my_reviews = data['Reviews'].tolist()
senti = []
while(i<data.shape[0]):
senti.append(funcSENT(my_reviews[i]))
i=i+1
But somehow I get this error and I am not able to find the problem. Its been many hours now, kindly help.
[1]: https://i.stack.imgur.com/qFbCl.jpg
How to avoid this error?

As I understand, you're using pycorenlp with nlp=StanfordCoreNLP(...) and a running StanfordCoreNLP server. I won't check the data you are using since it appears to require a Kaggle account.
Running with the same setup but different paragraph shows that printing "output" alone shows an error from the java server, in my case:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Input word not tagged
I THINK that because there is no part-of-speech annotator, the server cannot perform the parsing. Whenever you use parse or depparse, I think you need to have the "pos" annotator as well.
I am not sure what the sentiment annotator needs, but you may need other annotators such as "lemma" to get good sentiment results.
Print output by itself. If you get the same java error, try adding the "pos" annotator to see if you get the expected json. Otherwise, try to give a simpler example, using your own small dataset maybe, and comment or adjust your question.

Series not callable when trying to parse string in DataFrame

I tried looking, but clearly I am missing a trick here. I tried to use couple of ideas on splitting a string separated by ; in a DataFrame in Python.
Can anybody tell me what I am doing wrong, I have only just picked up Python and would appreciate help. What I want is to split the string in recipient-address and duplicate the rest of the rows for each row. I have a LOT of log files to get through so it needs to be efficient. I am using Anaconda python version 2.7 o Windows 7 64bit. Thanks.
The data in the input looks roughly like this:
#Fields: date-time,sender-address,recipient-address
2015-06-22T00:00:01.051Z, persona#gmail.com, other#gmail.com;mickey#gmail.com
2015-06-22T00:00:01.254Z, personb#gmail.com, mickey#gmail.com
What I am aiming at is:
#Fields: date-time,sender-address,recipient-address
2015-06-22T00:00:01.051Z, persona#gmail.com, other#gmail.com
2015-06-22T00:00:01.051Z, persona#gmail.com, mickey#gmail.com
2015-06-22T00:00:01.254Z, personb#gmail.com, mickey#gmail.com
I have tried this based on this
for LOGfile in LOGfiles[:1]:
readin = pandas.read_csv(LOGfile, skiprows=[0,1,2,3], parse_dates=['#Fields: date-time'], date_parser = dateparse )
#s = df['recipient-address'].str.split(';').apply(Series, 1).stack()
df=pandas.concat([Series(row['#Fields: date-time'], row['sender-address'],row['recipient-address'].split(';'))
for _, row in readin.iterrows()]).reset_index()
I keep getting the error:
NameError Traceback (most recent call last)
in ()
4 readin = pandas.read_csv(LOGfile, skiprows=[0,1,2,3], parse_dates= ['#Fields: date-time'], date_parser = dateparse )
5 df=pandas.concat([Series(row['#Fields: date-time'], row['sender-address'],row['recipient-address'].split(';'))
----> 6 for _, row in readin.iterrows()]).reset_index()
7
NameError: name 'Series' is not defined

I updated this with more complete/correct code - it now generates one row in the output Dataframe df for each recipient-address in the input logfile.
This might not be the most efficient solution but at least it works :-)
Err, you would get a quicker and easier-for-the-answerer answer if with your question you a) give a complete and executable short example of code you have tried which works to reproduce your error, and b) include sample data needed to reproduce the error, and c) include example output/error messages from the code you show with the data you show. It's probably also a good idea to include version numbers and the platform you are running on. I'm working with 32-bit python 2.7.8 on Windows 7 64-bit.
I created myself some sample data in a file log.txt:
date-time,sender-address,recipient-address
1-1-2015,me#my.com,me1#my.com;me2#my.com
2-2-2015,me3#my.com,me4#my.com;me5#my.com
I then created a complete working example python file (also making some minimal simplifications to your code snippet) and fixed it. My code which works with my data is:
import pandas
LOGfiles = ('log.txt','log.txt')
for LOGfile in LOGfiles[:1]:
readin = pandas.read_csv(LOGfile, parse_dates=['date-time'])
#s = df['recipient-address'].str.split(';').apply(Series, 1).stack()
rows = []
for _, row in readin.iterrows():
for recip in row['recipient-address'].split(';'):
rows.append(pandas.Series(data={'date-time':row['date-time'], 'sender-address':row['sender-address'],'recipient-address':recip}))
df = pandas.concat(rows)
print df
The output from this code is:
date-time 2015-01-01 00:00:00
recipient-address me1#my.com
sender-address me#my.com
date-time 2015-01-01 00:00:00
recipient-address me2#my.com
sender-address me#my.com
date-time 2015-02-02 00:00:00
recipient-address me4#my.com
sender-address me3#my.com
date-time 2015-02-02 00:00:00
recipient-address me5#my.com
sender-address me3#my.com
dtype: object
The main thing I did to find out what was wrong with your code was to break the problem down because your code may be short but it includes several potential sources of problems as well as the split - first I made sure the iteration over the rows works and that the split(';') works as expected (it does), then I started constructing a Series and found I needed the pandas. prefix to Series, and the data={} as a dictionary.
HTH
barny

I updated the code below to add untested code for passing through the first six lines of the logfile directly to the output.
If all you're doing with the csv logfiles is this transformation, then a possibly faster approach - although not without some significant potential disadvantages - would be to avoid csv reader/pandas and process the csv logfiles at a text level, maybe something like this:
LOGfiles = ('log.txt','log.txt')
outfile = open( 'result.csv',"wt")
for LOGfile in LOGfiles[:1]:
linenumber=0
for line in open(LOGfile,"rt"):
linenumber += 1
if linenumber < 6:
outfile.write(line)
else:
line = line.strip()
fields = line.split(",")
recipients = fields[2].split(';')
for recip in recipients:
outfile.write(','.join([fields[0],fields[1],recip])+'\n')
Some of the disadvantages of this approach are:
The field for recipient-address is hardcoded, as are the fields for
output
It happens to pass-through the header line - you may want to
make this more robust e.g. by reading the header line before getting
into the expansion code
It assumes that the csv field seperator is hardcoded comma (,) and so
won't like if any of the the fields in the csv file contain a
comma
It probably works OK with ascii csv files, but may barf on extended
character sets (UTF, etc.) which are very common found these days
It will likely be harder to maintain than the pandas approach
Some of these are quite serious and would take a lot of messing about to fix if you were going to code it yourself - particularly the character sets - so personally it's difficult to strongly recommend this approach - you need to weigh up the pros and cons for your situation.
HTH
barny

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas data mining from Eurostat - python

Related

How do you get a detailed description text on a given patent?

Creating a simple webpage with Python, where template content is populated from a database (or a pandas dataframe) based on query

r reticulate: rename duplicates from converted Python pandas dataframe

Unable to run Stanford Core NLP annotator over whole data set

Series not callable when trying to parse string in DataFrame

Categories

Resources