Pandas dataframe to list of tuples - python

I parsed a .xlsx file to a pandas dataframe and desire converting to a list of tuples. The pandas dataframe has two columns.
The list of tuples requires the product_id grouped with the transaction_id. I saw a post on creating a pandas dataframe to list of tuples, but the code result grouped with transaction_id grouped with `product_id.
How can I get the list of tuples in the desired format on the bottom of the page?
import pandas as pd
import xlrd
#Import data
trans = pd.ExcelFile('/Users/Transactions.xlsx')
#parse xlsx file into dataframe
transdata = trans.parse('Orders')
#view dataframe
#print transdata
transaction_id product_id
0 20001 48165
1 20001 48162
2 20001 48166
3 20004 48815
4 20005 48165
transdata = trans.parse('Orders')
#Create tuple
trans_set = [tuple(x) for x in subset.values]
print trans_set
[(20001, (48165), (20001, 48162), (20001, 48166), (20004, 48815), (20005, 48165)]
Desired Result:
[(20001, [48165, 48162, 48166]), (20004, 48815), (20005, 48165)]

trans_set = [(key,list(grp)) for key, grp in
transdata.groupby(['transaction_id'])['product_id']]
In [268]: trans_set
Out[268]: [(20001, [48165, 48162, 48166]), (20004, [48815]), (20005, [48165])]
This is a little different than your desired result -- note the (20004, [48815]), for example -- but I think it is more consistent. The second item in each tuple is a list of all the product_ids which are associate with the transaction_id. It might consist of only one element, but it is always a list.
To write trans_set to a CSV, you could use the csv module:
import csv
with open('/tmp/data.csv', 'wb') as f:
writer = csv.writer(f)
for key, grp in trans_set:
writer.writerow([key]+grp)
yields a file, /tmp/data.csv, with content:
20001,48165,48162,48166
20004,48815
20005,48165

Related

Pandas dataframe writing to excel as list. But I don't want data as list in excel

I have a code which iterate through excel and extract values from excel columns as loaded as list in dataframe. When I write dataframe to excel, I am seeing data with in [] and quotes for string ['']. How can I remove [''] when I write to excel.
Also I want to write only first value in product ID column to excel. how can I do that?
result = pd.DataFrame.from_dict(result) # result has list of data
df_t = result.T
writer = pd.ExcelWriter(path)
df_t.to_excel(writer, 'data')
writer.save()
My output to excel
I am expecting output as below and Product_ID column should only have first value in list
I tried below and getting error
path = path to excel
df = pd.read_excel(path, engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel(path)
I am getting below error:
item = eval(data)
TypeError: eval() arg 1 must be a string, bytes or code object
df_t['id'] = df_t['id'].str[0] # this is a shortcut for if you only want the 0th index
df_t['other_columns'] = df_t['other_columns'].apply(lambda x: " ".join(x)) # this is to "unlist" the lists of lists which you have fed into a pandas column
This should be the effect you want, but you have to make sure that the data in each cell is ['', ...] form, and if it's different you can modify the way it's handled in the data_clean function:
import pandas as pd
df = pd.read_excel("1.xlsx", engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel("new.xlsx")
The following is an example of df and modified new_df(Some randomly generated data):
# df
name Product_ID xxx yyy
0 ['Allen'] ['AF124', 'AC12414'] [124124] [222]
1 ['Aaszflen'] ['DF124', 'AC12415'] [234125] [22124124,124125]
2 ['Allen'] ['CF1sdv24', 'AC12416'] [123544126] [33542124124,124126]
3 ['Azdxven'] ['BF124', 'AC12417'] [35127] [333]
4 ['Allen'] ['MF124', 'AC12418'] [3528] [12352324124,124128]
5 ['Allen'] ['AF124', 'AC12419'] [122359] [12352324124,124129]
# new_df
name Product_ID xxx yyy
0 Allen AF124 124124 222
1 Aaszflen DF124 234125 22124124
2 Allen CF1sdv24 123544126 33542124124
3 Azdxven BF124 35127 333
4 Allen MF124 3528 12352324124
5 Allen AF124 122359 12352324124

convert dataframe row to dict

I have datarame like the sample data below. I'm trying to convert one row from the dataframe in to a dict like the desired output below. But when I use to_dict I get the indice along with the column value. Does anyone know how to get convert the row to a dict like the desired output? Any tips greatly appreciated.
Sample data:
print(catStr_df[['Bottle Volume (ml)', 'Pack']][:5])
Bottle Volume (ml) Pack
595 750 12
1889 750 12
3616 1000 12
4422 750 12
5022 750 12
Code:
v = catStr_df[catStr_df['Item Number']==34881][['Bottle Volume (ml)', 'Pack']]\
.drop_duplicates(keep='first').to_dict()
v
Output:
{'Bottle Volume (ml)': {9534: 1000}, 'Pack': {9534: 12}}
Desired output:
{'Bottle Volume (ml)': 1000, 'Pack': 12}
Try adding .to_dict('records')[0] to the row you want
catStr_df[catStr_df['Item Number']==34881].to_dict('records')[0]
Use df.to_dict(orient='index') to have index value as keys for easy retrieval of data
taking a different tactic, this works but you need to get a list of columns. This assumed you want the index number as a dict item
def row_converter(row, listy):
#convert pandas row to a dictionary
#requires a list of columns and a row as a tuple
count = 1
pictionary = {}
pictionary['Index'] = row[0]
for item in listy:
pictionary[item] = row[count]
count += 1
print(pictionary)
return pictionary
df = PD.read_csv("yourFile", dtype=object, delimiter=",", na_filter=False)
listy = df.columns
for row in df.itertuples():
rowDict = row_converter(row, listy)

Pandas: Trouble setting value for each column

I have an empty Pandas dataframe and I'm trying to add a row to it. Here's what I mean:
text_img_count = len(BeautifulSoup(html, "lxml").find_all('img'))
print 'img count: ', text_img_count
keys = ['text_img_count', 'text_vid_count', 'text_link_count', 'text_par_count', 'text_h1_count',
'text_h2_count', 'text_h3_count', 'text_h4_count', 'text_h5_count', 'text_h6_count',
'text_bold_count', 'text_italic_count', 'text_table_count', 'text_word_length', 'text_char_length',
'text_capitals_count', 'text_sentences_count', 'text_middles_count', 'text_rows_count',
'text_nb_digits', 'title_char_length', 'title_word_length', 'title_nb_digits']
values = [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count,
text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count,
text_bold_count, text_italic_count, text_table_count, text_word_length,
text_char_length, text_capitals_count, text_sentences_count, text_middles_count,
text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
numeric_df = pd.DataFrame()
for key, value in zip(keys, values):
numeric_df[key] = value
print numeric_df.head()
However, the output is this:
img count: 2
Empty DataFrame
Columns: [text_img_count, text_vid_count, text_link_count, text_par_count, text_h1_count, text_h2_count, text_h3_count, text_h4_count, text_h5_count, text_h6_count, text_bold_count, text_italic_count, text_table_count, text_word_length, text_char_length, text_capitals_count, text_sentences_count, text_middles_count, text_rows_count, text_nb_digits, title_char_length, title_word_length, title_nb_digits]
Index: []
[0 rows x 23 columns]
This makes it seem like numeric_df is empty after I just assigned values for each of its columns.
What's going on?
Thanks for the help!
What I usually do to add a column to the empty data frame is to append the information into a list and then give it a data frame structure. For example:
df=pd.DataFrame()
L=['a','b']
df['SomeName']=pd.DataFrame(L)
And you have to use pd.Series() if the list is make of numbers.

python numpy csv header in column not row

I have a script which produces a 15x1096 array of data using
np.savetxt("model_concentrations.csv", model_con, header="rows:','.join(sources), delimiter=",")
Each of the 15 rows corresponds to a source of emissions, while each column is 1 day over 3 years. If at all possible I would like to have a 'header' in column 1 which states the emssion source. When i use the option "header='source1,source2,...'" these labels get placed in the first row (like expected). ie.
2per 3rd_pvd 3rd_unpvd 4rai_rd 4rai_yd 5rmo 6hea
2.44E+00 2.12E+00 1.76E+00 1.33E+00 6.15E-01 3.26E-01 2.29E+00 ...
1.13E-01 4.21E-02 3.79E-02 2.05E-02 1.51E-02 2.29E-02 2.36E-01 ...
My question is, is there a way to inverse the header so the csv appears like this:
2per 7.77E+00 8.48E-01 ...
3rd_pvd 1.86E-01 3.62E-02 ...
3rd_unpvd 1.04E+00 2.65E-01 ...
4rai_rd 8.68E-02 2.88E-02 ...
4rai_yd 1.94E-01 8.58E-02 ...
5rmo 7.71E-01 1.17E-01 ...
6hea 1.07E+01 2.71E+00 ...
...
Labels for rows and columns is one of main reasons for the existence of pandas.
import pandas as pd
# Assemble your source labels in a list
sources = ['2per', '3rd_pvd', '3rd_unpvd', '4rai_rd',
'4rai_yd', '5rmo', '6hea', ...]
# Create a pandas DataFrame wrapping your numpy array
df = pd.DataFrame(model_con, index=sources)
# Saving it a .csv file writes the index too
df.to_csv('model_concentrations.csv', header=None)

Extract nested JSON embedded as string in Pandas dataframe

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Categories