I've been searching everywhere for a tip, however can't seem to find an answer.
I am trying to show items which have the same type
i.e. here's my dataset
What I want to end up with is a list of "Names" which are both a book and a movie.
i.e. the output should be "Harry Potter" and "LoTR".
i.e. a list like below with the "Name" column only which would show the two items:
I was thinking of doing a pivot, but not sure where to go from there.
ct = pd.crosstab(df["name"], df["type"]).astype(bool)
result = ct.index[ct["book"] & ct["movie"]].to_list()
please try this:
df_new = df[['Name','Type']].value_counts().reset_index()['Name'].value_counts().reset_index()
names = list(df_new[df_new['Name']>1]['index'].unique())
The above code gives all names with more than one type. If you want exactly names with two types, change the 2nd line to this:
names = list(df_new[df_new['Name']==2]['index'].unique())
You can use intersection of set:
>>> list(set(df.loc[df['Type'] == 'Movie', 'Name']) \
.intersection(df.loc[df['Type'] == 'Book', 'Name']))
['Harry Potter', 'LoTR']
Or
>>> df.loc[df['Type'] == 'Movie', 'Name'] \
.loc[lambda x: x.isin(df.loc[df['Type'] == 'Book', 'Name'])].tolist()
['Harry Potter', 'LoTR']
Related
I need to insert values from a list to corresponding column in a df (first value to first column, second value to second column etc.), but unable to do it.
Using the next code values paste only to rows, not it columns.
list1 = pd.Series(list1)
pd.concat([df_sum, list1])
df_sum.append(list1) gives the same result.
So any suggestions?
Code for example:
df_sum = pd.DataFrame(columns = ['russian language', 'english language', 'mathemathics', 'geometry', 'algebra', 'computer science', 'history',
'geography', 'biology', 'social science', 'chemistry', 'physics'])
list1 = [28323,
29302,
9570,
22014,
18359,
20514,
25139,
24678,
23215,
20640,
19904,
21494]
You can achieve that by using list and loc
to_append = list(list1)
df_length = len(df_sum)
df_sum.loc[df_length] = to_append
pd.DataFrame(data=df_sum)
For a quick answer:
I would use a dictionary to compile by index
dict_temp = {}
for i, col in enumerate(df_sum.columns.values):
dict_temp[col] = [list1[i]]
pd.DataFrame.from_dict(dict_temp)
or do it like that. But depends on your future purposes. I guess you want to add more lists at some point?
for i, col in enumerate(df_sum.columns.values):
df_sum.at[0, col] = list1[i]
But there might be a better solution
I have a pandas series whose unique values are something like:
['toyota', 'toyouta', 'vokswagen', 'volkswagen,' 'vw', 'volvo']
Now I want to fix some of these values like:
toyouta -> toyota
(Note that not all values have mistakes such as volvo, toyota etc)
I've tried making a dictionary where key is the correct word and value is the word to be corrected and then map that onto my series.
This is how my code looks:
corrections = {'maxda': 'mazda', 'porcshce': 'porsche', 'toyota': 'toyouta', 'vokswagen': 'vw', 'volkswagen': 'vw'}
df.brands = df.brands.map(corrections)
print(df.brands.unique())
>>> [nan, 'mazda', 'porsche', 'toyouta', 'vw']
As you can see the problem is that this way, all values not present in the dictionary are automatically converted to nan. One solution is to map all the correct values to themselves, but I was hoping there could be a better way to go about this.
Use:
df.brands = df.brands.map(corrections).fillna(df.brands)
Or:
df.brands = df.brands.map(lambda x: corrections.get(x, x))
Or:
df.brands = df.brands.replace(corrections)
I would like to reorder a list of strings (column headers from Pandas) in Python 2.7.13 based on a regular expression. The desired output will have the current 0 index item in the same place, followed immediately by the matched strings found using the regular expression, followed by the remaining strings.
# Here's the input list:
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
# And the desired output:
output_cols = ['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']
I have a working code example. It's not pretty, and that's why I'm here.
import re
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
pattern = re.compile(r'^FC|FC$')
matched_cols = filter(pattern.search, cols)
indices = [0] + [cols.index(match_column) for match_column in matched_cols]
output_cols, counter = [], 0
for index in indices:
output_cols.append(cols.pop(index - counter))
counter += 1
output_cols += cols
print(output_cols)
Is there a more readable, more pythonic way to accomplish this?
Isolate first element, no way around that.
Then, on the rest of the list, use a sort key which returns a couple:
first priority a boolean to indicate it matches the regex or not (negated so it appears first)
second priority the element itself to tiebreak matching/non matching elements
like this:
import re
cols = ['ID', 'MP', 'FC', 'Dest_MP', 'Dest_FC', 'Origin_MP', 'Origin_FC']
new_cols = [cols[0]] + sorted(cols[1:],key=lambda x : (not bool(re.search("^FC|FC$",x)),x))
result:
['ID', 'Dest_FC', 'FC', 'Origin_FC', 'Dest_MP', 'MP', 'Origin_MP']
if you want FC to appear first, add a third value to the returned key. Let's choose the length of the strings (not clear what you really want to see as a tiebreaker
key=lambda x : (not bool(re.search("^FC|FC$",x)),len(x),x)
result is now:
['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']
note that sort is stable, so maybe you don't need a tiebreaker at all:
new_cols = [cols[0]] + sorted(cols[1:],key=lambda x : not bool(re.search("^FC|FC$",x)))
result:
['ID', 'FC', 'Dest_FC', 'Origin_FC', 'MP', 'Dest_MP', 'Origin_MP']
This seems like it should be easy, but I can't seem to find what I'm looking for...I have two lists of people, FirstName, LastName, Date of Birth, and I just want to know which people are in both lists, and which ones are in one but not the other.
I've tried something like
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
Based on something else I found online, but it give me this error:
KeyError: 'Date of Birth'
I've verified that that is indeed the column heading in the second list, so I don't get what's wrong. Anyone do matching like this? What's the easiest/fastest way? The names may have different formatting between lists, like "Smith-Jones" vs. "SmithJones" vs. "Smith Jones", but I get around that by stripping all spances and punctuation from the names...I assume that's a first good step?
Try this , it should work
import sys
from StringIO import StringIO
import pandas as pd
TESTDATA=StringIO("""DOB;First;Last
2016-07-26;John;smith
2016-07-27;Mathew;George
2016-07-28;Aryan;Singh
2016-07-29;Ella;Gayau
""")
list1 = pd.read_csv(TESTDATA, sep=";")
TESTDATA=StringIO("""Date of Birth;Patient First Name;Patient Last Name
2016-07-26;John;smith
2016-07-27;Mathew;XXX
2016-07-28;Aryan;Singh
2016-07-20;Ella;Gayau
""")
list2 = pd.read_csv(TESTDATA, sep=";")
print list2
print list1
common = pd.merge(list1, list2, how='left', left_on=['Last', 'First', 'DOB'], right_on=['Patient Last Name', 'Patient First Name', 'Date of Birth']).dropna()
print common
I have a list, with each entry being a company name
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
I want to create a new dataframe for each entry in the list.
Something like
(pseudocode)
for c in companies:
c = pd.DataFrame()
I have searched for a way to do this but can't find it. Any ideas?
Just to underline my comment to #maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you are better writing
for name, df in d.iteritems():
because this avoids instantiating a list of (name, df) tuples.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
Adding to the above great answers. The above will work flawless if you need to create empty data frames but if you need to create multiple dataframe based on some filtering:
Suppose the list you got is a column of some dataframe and you want to make multiple data frames for each unique companies fro the bigger data frame:-
First take the unique names of the companies:-
compuniquenames = df.company.unique()
Create a data frame dictionary to store your data frames
companydict = {elem : pd.DataFrame() for elem in compuniquenames}
The above two are already in the post:
for key in DataFrameDict.keys():
DataFrameDict[key] = df[:][df.company == key]
The above will give you a data frame for all the unique companies with matching record.
Below is the code for dynamically creating data frames in loop:
companies = ['AA', 'AAPL', 'BA', ....., 'YHOO']
for eachCompany in companies:
#Dynamically create Data frames
vars()[eachCompany] = pd.DataFrame()
For difference between vars(),locals() and globals() refer to the below link:
What's the difference between globals(), locals(), and vars()?
you can do this way:
for xxx in yyy:
globals()[f'dataframe_{xxx}'] = pd.Dataframe(xxx)
The following is reproducable -> so lets say you have a list with the df/company names:
companies = ['AA', 'AAPL', 'BA', 'YHOO']
you probably also have data, presumably also a list? (or rather list of lists) like:
content_of_lists = [
[['a', '1'], ['b', '2']],
[['c', '3'], ['d', '4']],
[['e', '5'], ['f', '6']],
[['g', '7'], ['h', '8']]
]
in this special example the df´s should probably look very much alike, so this does not need to be very complicated:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
dic["df_{}".format(n)] = pd.DataFrame(content_of_lists[m]).rename(columns = {0: "col_1", 1:"col_2"})
Here you would have to use dic["df_AA"] to get to the dataframe inside the dictionary.
But Should you require more "distinct" naming of the dataframes I think you would have to use for example if-conditions, like:
dic={}
for n,m in zip(companies, range(len(content_of_lists))):
if n == 'AA':
special_naming_1 = pd.DataFrame(content_of_lists[m]).rename(columns = {0:
"col_1", 1:"col_2"})
elif n == 'AAPL':
special_naming_2 ...
It is a little more effort but it allows you to grab the dataframe object in a more conventional way by just writing special_naming_1 instead of dic['df_AA'] and gives you more controll over the dataframes names and column names if that´s important.