joining a list and a list of lists in python - python

I have what should be a simple problem but 3 hours into trying different things and I cant solve it.
I have a pymysql returning me results from a query. I cant share the exact example but this straw man should do.
cur.execute("select name, address, phonenum from contacts")
This returns results perfectly which i grab with
results = cur.fetchall()
and then convert to a list object exactly as I want it
data = list(results)
Unfortunately this doesn't include the header but you can get it with cur.description (which contains metadata including but not limited to the header). I push this into a list
Header=[]
for n in cur.description:
header.append(str((n[0])))
so my header looks like:
['name','address','phonenum']
and my results look like:
[['Tom','dublin','12345'],['Bob','Kerry','56789']]
I want to create a dataframe in pandas and then pivot it but it needs column headers to work properly. I had previously been importing a completed csv into a pandas DF which included the header so this all worked smoothly but now i need to get this data direct from the DB so I was thinking, that's easy, I just join the two lists and hey presto I have what I am looking for, but when i try to append I actually wind up with this:
['name','address','phonenum',['Tom','dublin','12345'],['Bob','Kerry','56789']]
when i need this
[['name','address','phonenum'],['Tom','dublin','12345'],['Bob','Kerry','56789']]
Anyone any ideas?
Much appreciated!

Addition of lists concatenates contents:
In [17]: [1] + [2,3]
Out[17]: [1, 2, 3]
This is true even if the contents are themselves lists:
In [18]: [[1]] + [[2],[3]]
Out[18]: [[1], [2], [3]]
So:
In [13]: header = ['name','address','phonenum']
In [14]: data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
In [15]: [header] + data
Out[15]:
[['name', 'address', 'phonenum'],
['Tom', 'dublin', '12345'],
['Bob', 'Kerry', '56789']]
In [16]: pd.DataFrame(data, columns=header)
Out[16]:
name address phonenum
0 Tom dublin 12345
1 Bob Kerry 56789
Note that loading a DataFrame with data from a database can also be done with pandas.read_sql.

is that what you are looking for?
first = ['name','address','phonenum']
second = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
second = [first] + second
print second
'[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]'

Other possibilities:
You could insert it into data location 0 as a list
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header)
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]
But if you are going to manipulate header variable you can shallow copy it
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header[:])
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]

Related

Convert panda dataframe group of values to multiple lists

I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).

How do I split a names column in a pandas data frame if only some of the names have middle names?

I am working with a pandas data frame of names and there are a few different formats of names. Some are 'first' 'last, others are 'first' 'middle' 'last', and others are 'first initial' 'second initial' 'last'. I would like to split these into three columns by using the strings. I am currently trying to use the split function but I am getting "ValueError: Columns must be same length as key" because some names will split into two columns and others will be split into three. How can I get around this?
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
mak_df[['First', 'Middle', 'Last']] = mak_df.Author_Name.str.split(" ", expand = True)
Here is a workaround:
import pandas as pd
list_of_names = ['bradley efron', 'c arden pope', 'a l smith']
new_list =[]
for name in list_of_names:
new_list.append(name.split(" "))
print(new_list)
for name in new_list:
if (len(name)==2):
name.insert(1," ")
print(new_list)
df = pd.DataFrame.from_records(new_list).T
df.index = ["first name","middle name","last name"]
df= df.T
print(df)
Output:
There's probably a better way to go about this, but here's what I've got:
df = {'name': ['bradley efron', 'c arden pope', 'a l smith']}
df=pd.DataFrame(df)
df=df['name'].str.split(' ',expand=True)
df.columns=['first','middle','last']
df['last']=np.where(df['last'].isnull(),df['middle'],df['last'])
df['middle']=np.where((df['middle']==df['last']),'',df['middle'])

Create multiple dataframes in loop

I have a list, with each entry being a company name
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
I want to create a new dataframe for each entry in the list.
Something like
(pseudocode)
for c in companies:
c = pd.DataFrame()
I have searched for a way to do this but can't find it. Any ideas?
Just to underline my comment to #maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you are better writing
for name, df in d.iteritems():
because this avoids instantiating a list of (name, df) tuples.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
Adding to the above great answers. The above will work flawless if you need to create empty data frames but if you need to create multiple dataframe based on some filtering:
Suppose the list you got is a column of some dataframe and you want to make multiple data frames for each unique companies fro the bigger data frame:-
First take the unique names of the companies:-
compuniquenames = df.company.unique()
Create a data frame dictionary to store your data frames
companydict = {elem : pd.DataFrame() for elem in compuniquenames}
The above two are already in the post:
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.company == key]
The above will give you a data frame for all the unique companies with matching record.
Below is the code for dynamically creating data frames in loop:
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
for eachCompany in companies:
#Dynamically create Data frames
vars()[eachCompany] = pd.DataFrame()
For difference between vars(),locals() and globals() refer to the below link:
What's the difference between globals(), locals(), and vars()?
you can do this way:
for xxx in yyy:
globals()[f'dataframe_{xxx}'] = pd.Dataframe(xxx)
The following is reproducable -> so lets say you have a list with the df/company names:
companies = ['AA', 'AAPL', 'BA', 'YHOO']
you probably also have data, presumably also a list? (or rather list of lists) like:
content_of_lists = [
[['a', '1'], ['b', '2']],
[['c', '3'], ['d', '4']],
[['e', '5'], ['f', '6']],
[['g', '7'], ['h', '8']]
]
in this special example the df´s should probably look very much alike, so this does not need to be very complicated:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
dic["df_{}".format(n)] = pd.DataFrame(content_of_lists[m]).rename(columns = {0: "col_1", 1:"col_2"})
Here you would have to use dic["df_AA"] to get to the dataframe inside the dictionary.
But Should you require more "distinct" naming of the dataframes I think you would have to use for example if-conditions, like:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
if n == 'AA':
special_naming_1 = pd.DataFrame(content_of_lists[m]).rename(columns = {0:
"col_1", 1:"col_2"})
elif n == 'AAPL':
special_naming_2 ...
It is a little more effort but it allows you to grab the dataframe object in a more conventional way by just writing special_naming_1 instead of dic['df_AA'] and gives you more controll over the dataframes names and column names if that´s important.

Turn Python list from CSV with multi-value fields into a Python nested list, sort nested list values and export to CSV

I have used the Python csv module to turn a csv with multi-value fields into a Python list. The output contains fields with multiple values that are related.
['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive']
['ABC', 'ZYG098, AB0134, GF0158', 'A1, B2, C3', 'John Doe, Jane Doe, Abraham Lincoln', '20150301', 'A', 'B']
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C']
['ABC', 'XAZ012, AB0134, YZ089', 'C1, B2, A2 ', 'John Adams, Jane Doe, Thomas Jefferson', '20150302', 'A', 'B']
I would like to turn the Vehicles, Vehicle Class and Driver ID fields into a nested list so that if I sort each sub-list within Vehicle row[1] to ensure the vehicles always appear in alphabetical order in the sublist and, that the Vehicle Class and Driver stay in the respective, correct orders. So the header and first row sub-lists would be arranged like:
['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive']
['ABC', 'AB0134, GF0158, ZYG098', 'B2, C3, A1', 'Jane Doe, Abraham Lincoln, John Doe', '20150301', 'A', 'B']
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C']
['ABC', 'AB0134, YZ089, XAZ012', 'B2, A2, C1', 'Jane Doe, Thomas Jefferson, John Adams', '20150302', 'A', 'B']
So in the output above each of the sub-groups/lists for Vehicles is sorted alphabetically and the Vehicle Class and Driver_ID are re-arranged as necessary to retain their original relationship with their respective Vehicles (i.e. Driver ID - John Doe drove Vehicle - ZYG098 which was Vehicle Class - A1, so those items are moved in their sub-lists to reflect that ZYG098 is now last, not first). If this can be done, how would you export the resulting nested list back to a CSV with the original headers?
Apologies if this is simple or ridiculous, I am just starting to learn Python. If a nested list is not the best option, I am open to any other solution (for a dictionary, I would need to join fields to create a key, as there is no unique key without combining Route_Date). If anyone has a solid resource for handling a wide range of CSV use cases with Python a recommendation would be great.
Thank you in advance for your patience and assistance.
Finally on the same page, it took a bit of work but this will do what you want:
from itertools import chain
import csv
l = [['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive'],
['ABC', 'ZYG098, AB0134, GF0158', 'A1, B2, C3', 'John Doe, Jane Doe, Abraham Lincoln', '20150301', 'A', 'B'],
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C'],
['ABC', 'XAZ012, AB0134, YZ089', 'C1, B2, A2 ', 'John Adams, Jane Doe, Thomas Jefferson', '20150302', 'A', 'B']]
it = map(list,zip(*l))
# transpose original list, row-columns, columns-rows
it = zip(*l)
# get each column separately, using iter so we can pop first element
# off to get headers efficiently
route, veh, veh_c, d_id, date, start, arrive = iter(iter(next(it))), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it))
# get all headers to write later
headers = next(route), next(veh), next(veh_c), next(d_id), next(date), next(start), next(arrive)
srt_veh = []
key_inds = []
# sort vehicle elements and keep a record of old indexes
# so subelements in Vehicle_class and driver_id can be rearranged to match
for x in veh:
srt = sorted(x.split(","))
key_inds.append([x.split(",").index(w) for w in srt])
srt_veh.append(",".join(srt).strip())
srt_veh_cls = []
# sort vehicle class based on old index of elements in vehicles
# and rejoin split elements
for ind, ele in enumerate(veh_c):
spl = ele.split(",")
srt_veh_cls.append(",".join([spl[i].strip() for i in key_inds[ind]]))
srt_dr_id = []
# sort driver_ids based on old index of elements in vehicle
# and join subelements again after splitting and sorting
for ind, ele in enumerate(d_id):
spl = ele.split(",")
srt_dr_id.append(",".join([spl[i].strip() for i in key_inds[ind]]))
# transpose again for writing
zipped = zip(*(route, srt_veh, srt_veh_cls,
srt_dr_id, date, start, arrive))
finally write with csv.writerows:
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows(zipped)
Output:
Route,Vehicles,Vehicle Class,Driver_ID,Date,Start,Arrive
ABC,"AB0134, GF0158,ZYG098","B2,C3,A1","Jane Doe,Abraham Lincoln,John Doe",20150301,A,B
AC,ZGA123,C3,George Washington,20150301,A,C
ABC,"AB0134, YZ089,XAZ012","B2,A2,C1","Jane Doe,Thomas Jefferson,John Adams",20150302,A,B
For python 2 replace zip with itertools.izip and map with itertools.imap:
from itertools import izip, imap
You could zip more and a do a few things to shorten the code but I think that would not help the readability.
To convert to something like nested format you describe:
nested = zip(*lst)
And zip is its own inverse:
orig = zip(*nested)
But maybe what you really want is:
import operator
sort = sorted(lst[1:], key=operator.itemgetter(1))
Which gives you a new list sorted by row 1. In this case you haven't changed the format of the data, so you should be able to dump it back out as csv without modification, although you'd need to prepend the original headers from lst[0].

Best representation of a database table on memory and ways to query by line on python?

I have an XLS table I am parsing with python and currently I am representing that with a hash which has as key (line_number,column_number) -- in practice i am using other identifier but for the sake of simplicity this would work to explain the situation.
Now, say that I am interested on obtaining a given line of this table, I am using:
for k, v in self.table.iteritems():
if k[0] == '1':
# do whatever
In other words I am visiting every cell and verifying its line, which didn't seem the best way to do this. I am not sure at this point if using a hash with partial keys would be the best way either. Of course, using a database would be probably the best option but given my dataset is very small, I would rather keep the program simple and just working with it on memory. Thus my question narrows down to:
Is there a better way to represent a table on memory on python?
How would I go about obtaining a line out of it? (Given the best representation)
I hope the question is pertinent. I wasn't able to find the appropriate keywords to search around this problem.
Thank you.
from collections import defaultdict
columns = ['col1', 'col2']
table = [
['abc', 123],
['def', 456],
['ghi', 123],
]
view = dict((name, defaultdict(list)) for name in columns)
for row in table:
for i, col in enumerate(row):
name = columns[i]
view[name][col].append(row)
print view['col1']['abc']
print view['col2'][123]
I would modify Dugres' answer in order to provide access to the table by row, to use memory more efficiently and to try to make more natural use of the dict class.
In this example, the row keys are from a simple numeric enumeration, counting from 0.
columns = ['col1', 'col2']
table = [
['abc', 123],
['def', 456],
['ghi', 123],
]
# build view from a comprehension
view = {i: {columns[j]: table[i][j] for j in range(len(columns))} for i in range(len(table))}
# build view procedurally
view = dict()
for i, row in enumerate(table):
view[i] = dict()
for j, col in enumerate(row):
view[i][columns[j]] = col
# view contents
view
{0: {'col2': 123, 'col1': 'abc'},
1: {'col2': 456, 'col1': 'def'},
2: {'col2': 123, 'col1': 'ghi'}}
# Cell by row and column
>>> view[0]['col1']
'abc'
# List of cells for row 0:
>>> [view[0][col] for col in columns]
['abc', 123]
# All cells in col2:
>>> [view[row]['col2'] for row in sorted(view.keys())]
[123, 456, 123]
# All rows with value 123 in col2, using a list generator expression
[[view[row][column] for column in columns] for row in view.keys() for col in columns if view[row][col] == 123]
[['abc', 123], ['ghi', 123]]
# All rows with value 123 in col2, using a list generator function
def rowGenerator(view, col, value):
for row in view.keys():
if view[row][col] == value:
yield [view[row][colName] for colName in columns]
>>> [row for row in rowGenerator(view, 'col2', 123)]
[['abc', 123], ['ghi', 123]]
I've stored in-memory tables using arrays & lists from which you can call specific items based on position:
array = [
['John', 19, 'male'],
['Sara', 22, 'female']
]
Have you considered that?

Categories