I have a list, with each entry being a company name
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
I want to create a new dataframe for each entry in the list.
Something like
(pseudocode)
for c in companies:
c = pd.DataFrame()
I have searched for a way to do this but can't find it. Any ideas?
Just to underline my comment to #maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you are better writing
for name, df in d.iteritems():
because this avoids instantiating a list of (name, df) tuples.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
Adding to the above great answers. The above will work flawless if you need to create empty data frames but if you need to create multiple dataframe based on some filtering:
Suppose the list you got is a column of some dataframe and you want to make multiple data frames for each unique companies fro the bigger data frame:-
First take the unique names of the companies:-
compuniquenames = df.company.unique()
Create a data frame dictionary to store your data frames
companydict = {elem : pd.DataFrame() for elem in compuniquenames}
The above two are already in the post:
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.company == key]
The above will give you a data frame for all the unique companies with matching record.
Below is the code for dynamically creating data frames in loop:
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
for eachCompany in companies:
#Dynamically create Data frames
vars()[eachCompany] = pd.DataFrame()
For difference between vars(),locals() and globals() refer to the below link:
What's the difference between globals(), locals(), and vars()?
you can do this way:
for xxx in yyy:
globals()[f'dataframe_{xxx}'] = pd.Dataframe(xxx)
The following is reproducable -> so lets say you have a list with the df/company names:
companies = ['AA', 'AAPL', 'BA', 'YHOO']
you probably also have data, presumably also a list? (or rather list of lists) like:
content_of_lists = [
[['a', '1'], ['b', '2']],
[['c', '3'], ['d', '4']],
[['e', '5'], ['f', '6']],
[['g', '7'], ['h', '8']]
]
in this special example the df´s should probably look very much alike, so this does not need to be very complicated:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
dic["df_{}".format(n)] = pd.DataFrame(content_of_lists[m]).rename(columns = {0: "col_1", 1:"col_2"})
Here you would have to use dic["df_AA"] to get to the dataframe inside the dictionary.
But Should you require more "distinct" naming of the dataframes I think you would have to use for example if-conditions, like:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
if n == 'AA':
special_naming_1 = pd.DataFrame(content_of_lists[m]).rename(columns = {0:
"col_1", 1:"col_2"})
elif n == 'AAPL':
special_naming_2 ...
It is a little more effort but it allows you to grab the dataframe object in a more conventional way by just writing special_naming_1 instead of dic['df_AA'] and gives you more controll over the dataframes names and column names if that´s important.
Related
I've been searching everywhere for a tip, however can't seem to find an answer.
I am trying to show items which have the same type
i.e. here's my dataset
What I want to end up with is a list of "Names" which are both a book and a movie.
i.e. the output should be "Harry Potter" and "LoTR".
i.e. a list like below with the "Name" column only which would show the two items:
I was thinking of doing a pivot, but not sure where to go from there.
ct = pd.crosstab(df["name"], df["type"]).astype(bool)
result = ct.index[ct["book"] & ct["movie"]].to_list()
please try this:
df_new = df[['Name','Type']].value_counts().reset_index()['Name'].value_counts().reset_index()
names = list(df_new[df_new['Name']>1]['index'].unique())
The above code gives all names with more than one type. If you want exactly names with two types, change the 2nd line to this:
names = list(df_new[df_new['Name']==2]['index'].unique())
You can use intersection of set:
>>> list(set(df.loc[df['Type'] == 'Movie', 'Name']) \
.intersection(df.loc[df['Type'] == 'Book', 'Name']))
['Harry Potter', 'LoTR']
Or
>>> df.loc[df['Type'] == 'Movie', 'Name'] \
.loc[lambda x: x.isin(df.loc[df['Type'] == 'Book', 'Name'])].tolist()
['Harry Potter', 'LoTR']
I have a pandas data.frame
temp = pd.DataFrame({'country':['C1','C1','C1','C1','C2','C2','C2','C2'],
'seg': ['S1','S2','S1','S2','S1','S2','S1','S2'],
'agegroup': ['1', '2', '2', '1','1', '2', '2', '1'],
'N' : [21,22,23,24,31,32,33,34]})
and a vector like
vector = ['country', 'seg']
what i want to do is to create two vectors with names vector_country and vector_seg which will contain the respective columns of the temp, in this case the columns country and seg
I have tried
for vec in vector:
'vector_' + str(vec) = temp[[vec]]
So in the end I would like to end up with two vectors:
vector_country, which will contain the temp.country and
vector_seg, which will contain the temp.seg
Is it possible to do something like that in python ?
Do not try and dynamically name variables. This is bad practice, will make your code intractable.
A better alternative is to use dictionaries, as so:
v = {}
for vec in ['country', 'seg']:
v[vec] = temp[vec].values
I am working on loading a dataset from a pickle file like this
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
It works fine and loads the data correctly. This is an example of one row:
'GLISAN JR BEN F': {'salary': 274975, 'to_messages': 873, 'deferral_payments': 'NaN', 'total_payments': 1272284, 'exercised_stock_options': 384728, 'bonus': 600000, 'restricted_stock': 393818, 'shared_receipt_with_poi': 874, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 778546, 'expenses': 125978, 'loan_advances': 'NaN', 'from_messages': 16, 'other': 200308, 'from_this_person_to_poi': 6, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 71023, 'email_address': 'ben.glisan#enron.com', 'from_poi_to_this_person': 52}
Now, how can get the number of features? e.g (salary, to_messages, .... , from_poi_to_this_person) ?
I got this row by printing my whole dataset (print data_dict) and this is one of the results. I want to know how many features are there is general i.e. in the whole dataset without specifying a key in the dictionary.
Thanks
Try this.
no_of_features = len(data_dict[data_dict.keys()[0]])
This will work only if all your keys in data_dict have same number of features.
or simply
no_of_features = len(data_dict['GLISAN JR BEN F'])
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
print len(data_dict)
I think you want to find out the size of the set of all unique field names used in the row dictionaries. You can find that like this:
data_dict = {
'red':{'alpha':1,'bravo':2,'golf':3,'kilo':4},
'green':{'bravo':1,'delta':2,'echo':3},
'blue':{'foxtrot':1,'tango':2}
}
unique_features = set(
feature
for row_dict in data_dict.values()
for feature in row_dict.keys()
)
print(unique_features)
# {'golf', 'delta', 'foxtrot', 'alpha', 'bravo', 'echo', 'tango', 'kilo'}
print(len(unique_features))
# 8
Apply sum to the len of each nested dictionary:
sum(len(v) for _, v in data_dict.items())
v represents a nested dictionary object.
Dictionaries will naturally return their keys when you call an iterator on them (or something of that sort), so calling len will return the number of keys in each nested dictionary, viz. number of features.
If the features may be duplicated across nested objects, then collect them in a set and apply len
len(set(f for v in data_dict.values() for f in v.keys()))
Here is the answer
https://discussions.udacity.com/t/lesson-5-number-of-features/44253/4
where we choose 1 person in this case SKILLING JEFFREY K within the database called enron_data. and then we print the lenght of the keys in the dictionary.
print len(enron_data["SKILLING JEFFREY K"].keys())
I have what should be a simple problem but 3 hours into trying different things and I cant solve it.
I have a pymysql returning me results from a query. I cant share the exact example but this straw man should do.
cur.execute("select name, address, phonenum from contacts")
This returns results perfectly which i grab with
results = cur.fetchall()
and then convert to a list object exactly as I want it
data = list(results)
Unfortunately this doesn't include the header but you can get it with cur.description (which contains metadata including but not limited to the header). I push this into a list
Header=[]
for n in cur.description:
header.append(str((n[0])))
so my header looks like:
['name','address','phonenum']
and my results look like:
[['Tom','dublin','12345'],['Bob','Kerry','56789']]
I want to create a dataframe in pandas and then pivot it but it needs column headers to work properly. I had previously been importing a completed csv into a pandas DF which included the header so this all worked smoothly but now i need to get this data direct from the DB so I was thinking, that's easy, I just join the two lists and hey presto I have what I am looking for, but when i try to append I actually wind up with this:
['name','address','phonenum',['Tom','dublin','12345'],['Bob','Kerry','56789']]
when i need this
[['name','address','phonenum'],['Tom','dublin','12345'],['Bob','Kerry','56789']]
Anyone any ideas?
Much appreciated!
Addition of lists concatenates contents:
In [17]: [1] + [2,3]
Out[17]: [1, 2, 3]
This is true even if the contents are themselves lists:
In [18]: [[1]] + [[2],[3]]
Out[18]: [[1], [2], [3]]
So:
In [13]: header = ['name','address','phonenum']
In [14]: data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
In [15]: [header] + data
Out[15]:
[['name', 'address', 'phonenum'],
['Tom', 'dublin', '12345'],
['Bob', 'Kerry', '56789']]
In [16]: pd.DataFrame(data, columns=header)
Out[16]:
name address phonenum
0 Tom dublin 12345
1 Bob Kerry 56789
Note that loading a DataFrame with data from a database can also be done with pandas.read_sql.
is that what you are looking for?
first = ['name','address','phonenum']
second = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
second = [first] + second
print second
'[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]'
Other possibilities:
You could insert it into data location 0 as a list
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header)
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]
But if you are going to manipulate header variable you can shallow copy it
header = ['name','address','phonenum']
data = [['Tom','dublin','12345'],['Bob','Kerry','56789']]
data.insert(0,header[:])
print data
[['name', 'address', 'phonenum'], ['Tom', 'dublin', '12345'], ['Bob', 'Kerry', '56789']]
I have used the Python csv module to turn a csv with multi-value fields into a Python list. The output contains fields with multiple values that are related.
['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive']
['ABC', 'ZYG098, AB0134, GF0158', 'A1, B2, C3', 'John Doe, Jane Doe, Abraham Lincoln', '20150301', 'A', 'B']
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C']
['ABC', 'XAZ012, AB0134, YZ089', 'C1, B2, A2 ', 'John Adams, Jane Doe, Thomas Jefferson', '20150302', 'A', 'B']
I would like to turn the Vehicles, Vehicle Class and Driver ID fields into a nested list so that if I sort each sub-list within Vehicle row[1] to ensure the vehicles always appear in alphabetical order in the sublist and, that the Vehicle Class and Driver stay in the respective, correct orders. So the header and first row sub-lists would be arranged like:
['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive']
['ABC', 'AB0134, GF0158, ZYG098', 'B2, C3, A1', 'Jane Doe, Abraham Lincoln, John Doe', '20150301', 'A', 'B']
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C']
['ABC', 'AB0134, YZ089, XAZ012', 'B2, A2, C1', 'Jane Doe, Thomas Jefferson, John Adams', '20150302', 'A', 'B']
So in the output above each of the sub-groups/lists for Vehicles is sorted alphabetically and the Vehicle Class and Driver_ID are re-arranged as necessary to retain their original relationship with their respective Vehicles (i.e. Driver ID - John Doe drove Vehicle - ZYG098 which was Vehicle Class - A1, so those items are moved in their sub-lists to reflect that ZYG098 is now last, not first). If this can be done, how would you export the resulting nested list back to a CSV with the original headers?
Apologies if this is simple or ridiculous, I am just starting to learn Python. If a nested list is not the best option, I am open to any other solution (for a dictionary, I would need to join fields to create a key, as there is no unique key without combining Route_Date). If anyone has a solid resource for handling a wide range of CSV use cases with Python a recommendation would be great.
Thank you in advance for your patience and assistance.
Finally on the same page, it took a bit of work but this will do what you want:
from itertools import chain
import csv
l = [['Route', 'Vehicles', 'Vehicle Class', 'Driver_ID', 'Date', 'Start', 'Arrive'],
['ABC', 'ZYG098, AB0134, GF0158', 'A1, B2, C3', 'John Doe, Jane Doe, Abraham Lincoln', '20150301', 'A', 'B'],
['AC', 'ZGA123', 'C3', 'George Washington', '20150301', 'A', 'C'],
['ABC', 'XAZ012, AB0134, YZ089', 'C1, B2, A2 ', 'John Adams, Jane Doe, Thomas Jefferson', '20150302', 'A', 'B']]
it = map(list,zip(*l))
# transpose original list, row-columns, columns-rows
it = zip(*l)
# get each column separately, using iter so we can pop first element
# off to get headers efficiently
route, veh, veh_c, d_id, date, start, arrive = iter(iter(next(it))), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it)), iter(next(it))
# get all headers to write later
headers = next(route), next(veh), next(veh_c), next(d_id), next(date), next(start), next(arrive)
srt_veh = []
key_inds = []
# sort vehicle elements and keep a record of old indexes
# so subelements in Vehicle_class and driver_id can be rearranged to match
for x in veh:
srt = sorted(x.split(","))
key_inds.append([x.split(",").index(w) for w in srt])
srt_veh.append(",".join(srt).strip())
srt_veh_cls = []
# sort vehicle class based on old index of elements in vehicles
# and rejoin split elements
for ind, ele in enumerate(veh_c):
spl = ele.split(",")
srt_veh_cls.append(",".join([spl[i].strip() for i in key_inds[ind]]))
srt_dr_id = []
# sort driver_ids based on old index of elements in vehicle
# and join subelements again after splitting and sorting
for ind, ele in enumerate(d_id):
spl = ele.split(",")
srt_dr_id.append(",".join([spl[i].strip() for i in key_inds[ind]]))
# transpose again for writing
zipped = zip(*(route, srt_veh, srt_veh_cls,
srt_dr_id, date, start, arrive))
finally write with csv.writerows:
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows(zipped)
Output:
Route,Vehicles,Vehicle Class,Driver_ID,Date,Start,Arrive
ABC,"AB0134, GF0158,ZYG098","B2,C3,A1","Jane Doe,Abraham Lincoln,John Doe",20150301,A,B
AC,ZGA123,C3,George Washington,20150301,A,C
ABC,"AB0134, YZ089,XAZ012","B2,A2,C1","Jane Doe,Thomas Jefferson,John Adams",20150302,A,B
For python 2 replace zip with itertools.izip and map with itertools.imap:
from itertools import izip, imap
You could zip more and a do a few things to shorten the code but I think that would not help the readability.
To convert to something like nested format you describe:
nested = zip(*lst)
And zip is its own inverse:
orig = zip(*nested)
But maybe what you really want is:
import operator
sort = sorted(lst[1:], key=operator.itemgetter(1))
Which gives you a new list sorted by row 1. In this case you haven't changed the format of the data, so you should be able to dump it back out as csv without modification, although you'd need to prepend the original headers from lst[0].