I want to import OData XML datafeeds from the Dutch Bureau of Statistics (CBS) into our database. Using lxml and pandas I thought this should be straigtforward. By using OrderDict I want to preserve the order of the columns for readability, but somehow I can't get it right.
from collections import OrderedDict
from lxml import etree
import requests
import pandas as pd
# CBS URLs
base_url = 'http://opendata.cbs.nl/ODataFeed/odata'
datasets = ['/37296ned', '/82245NED']
feed = requests.get(base_url + datasets[1] + '/TypedDataSet')
root = etree.fromstring(feed.content)
# all record entries start at tag m:properties, parse into data dict
data = []
for record in root.iter('{{{}}}properties'.format(root.nsmap['m'])):
row = OrderedDict()
for element in record:
row[element.tag.split('}')[1]] = element.text
data.append(row)
df = pd.DataFrame.from_dict(data)
df.columns
Inspecting data, the OrderDict is in the right order. But looking at df.head() the columns have been sorted alphabetically with CAPS first?
Help, anyone?
Something in your example seems to be inconsistent, as data is a list and no dict, but assuming you really have an OrderedDict:
Try to explicitly specify your column order when you create your DataFrame:
# ... all your data collection
df = pd.DataFrame(data, columns=data.keys())
This should give you your DataFrame with the columns ordered just in exact the way they are in the OrderedDict (via the data.keys() generated list)
The above answer doesn't work for me and keep giving me "ValueError: cannot use columns parameter with orient='columns'".
Later I found a solution by doing this below and worked:
df = pd.DataFrame.from_dict (dict_data) [list (dict_data[0].keys())]
Related
I am using pyreadstat library to read sas dataset files(*.sas7bdat, *.xpt).
import pyreadstat as pd
import pandas as pda
import sys
import json
FILE_LOC = sys.argv[1]
PAGE_SIZE = 100
PAGE_NO = int(sys.argv[2])-1
START_FROM_ROW = (PAGE_NO * PAGE_SIZE)
pda.set_option('display.max_columns',None)
pda.set_option('display.width',None)
pda.set_option('display.max_rows',None)
df = pd.read_sas7bdat(FILE_LOC, row_offset=START_FROM_ROW, row_limit=PAGE_SIZE,output_format='dict')
finalList = []
for key in df[0]:
l = list(map(lambda x: str(x) if str(x)=="nan" else x, df[0][key].tolist()))
nparray = {key:l}
finalList.append(nparray)
return json.dumps(finalList)
How to perform sorting using pyreadstat library?
Unfortunately Pyreadstat cannot return sorted data. You need to read the sas7bdat file data into memory and then you can sort it.
In order to sort, take into consideration that Pyreadstat returns a tuple of a pandas dataframe and a metadata object. Once you have the dataframe you can sort it by one or multiple columns using the sort_values method . Therefore it is better to get a dataframe rather than a dictionary in this case.
df, meta = pd.read_sas7bdat(FILE_LOC, row_offset=START_FROM_ROW, row_limit=PAGE_SIZE)
#sort
df_sorted = df.sort_values(["columnA", "columnB"])
#replace nans
df = df.fillna("nan")
# you can directly convert to json
#check the options in the documentation, it may give you something different as you want.
out = df.to_json()
# otherwise transform to dict and build your json as before
out = df.to_dict(orient='list')
Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.
I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.
I want to simply create a csv file from the constructed DataFrame so I do not have to use the internet to access the information. The rows are the lists in the code: 'CIK' 'Ticker' 'Company' 'Sector' 'Industry'
My current code is as follows:
def stockStat():
doc = pq('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
for heading in doc(".mw-headline:contains('S&P 500 Component Stocks')").parent("h2"):
rows = pq(heading).next("table tr")
cik = []
ticker = []
coName = []
sector = []
industry = []
for row in rows:
tds = pq(row).find("td")
cik.append(tds.eq(7).text())
ticker.append(tds.eq(0).text())
coName.append(tds.eq(1).text())
sector.append(tds.eq(3).text())
industry.append(tds.eq(4).text())
d = {'CIK':cik, 'Ticker' : ticker, 'Company':coName, 'Sector':sector, 'Industry':industry}
stockData = pd.DataFrame(d)
stockData = stockData.set_index('Ticker')
stockStat()
As EdChum already mentioned in the comments, creating a CSV out of a pandas DataFrame is done with the DataFrame.to_csv() method.
The dataframe.to_csv() method takes lots of arguments, they are all covered in the DataFrame.to_csv() method documentation. Here is a small example for you:
import pandas as pd
df = pd.DataFrame({'mycolumn': [1,2,3,4]})
df.to_csv('~/myfile.csv')
After this, the myfile.csv should be available in your home directory.
If you are using windows, saving the file to 'C:\myfile.csv' should work better as a proof of concept.
I'm trying to create a Pandas DataFrame by iterating through data in a soup (from BeautifulSoup4). This SO post suggested using the .loc method to Set With Englargement to create a DataFrame.
However this method takes a long time to run (around 8 minutes for a df of 30,000 rows and 5 columns). Is there any quicker way of doing this. Here's my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)
col_names = ["name", "lat", "lng", "points_take", "points_hold"]
dfi = pd.DataFrame(columns=col_names)
def get_all_zones():
for attr in soup.find_all("zone"):
col_values= [attr.get("name"), attr.get("lat"), attr.get("lng"), attr.get("points_take"), attr.get("points_hold")]
pos = len(dfi.index)
dfi.loc[pos] = col_values
return dfi
get_all_zones()
Avoid
df.loc[pos] = row
whenever possible. Pandas NDFrames store the underlying data in blocks (of common dtype) which (for DataFrames) are associated with columns. DataFrames are column-based data structures, not row-based data structures.
To access a row, the DataFrame must access each block, pick out the appropriate row and copy the data into a new Series.
Adding a row to an existing DataFrame is also slow, since a new row must be appended to each block, and new data copied into the new row. Even worse, the data block has to be contiguous in memory. So adding a new row may force Pandas (or NumPy) to allocate a whole new array for the block and all the data for that block has to be copied into a larger array just to accomodate that one row. All that copying makes things very slow. So avoid it if possible.
The solution in this case is to append the data to a Python list and create the DataFrame in one fell swoop at the end:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://api.turfgame.com/v3/zones"
r = requests.get(url)
soup = BeautifulSoup(r.content)
col_names = ["name", "lat", "lng", "points_take", "points_hold"]
data = []
def get_all_zones():
for attr in soup.find_all("zone"):
col_values = [attr.get("name"), attr.get("lat"), attr.get(
"lng"), attr.get("points_take"), attr.get("points_hold")]
data.append(col_values)
dfi = pd.DataFrame(data, columns=col_names)
return dfi
dfi = get_all_zones()
print(dfi)