Normally to create a DataFrame with below code
df= pd.DataFrame({'a':[1],'b':[2]})
df
Output:
a b
0 1 2
But while I'm trying to create a DataFrame with one column name of 'start' its order is getting changed
df1 = pd.DataFrame({'start':[2],'end':[4]})
df1
Output:
end start
0 4 2
I'm trying to understand why this order is getting changed.
If you don't mention column name in order like columns=['', ''], it sometimes take the alphabetic order. As a result 'end'->e comes first and 'start'->s comes second.
This is because dictionaries are inherently unordered, and I wouldn't be surprised if it ordered alphabetically in this case.
As #GiovaniSalazar said:
df1 = pd.DataFrame({'start':[2],'end':[4]}, columns=['start','end'])
or, equivalently:
pd.DataFrame(data = [[2, 4]], columns=['start','end'])
Will force order with an ordered data structure
Related
I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I have created this dataframe
d = {'col1': [1], 'col2': [3]}
df = pd.DataFrame(data=d)
print(df)
I have then created a field called "columnA" which is supposed to be an array made of the two elements contained in col1 and col2:
filter_col = [col for col in df if col.startswith('col')]
df["columnA"] = df[filter_col].values.tolist()
print(df)
Now, I was expecting the columnA to be a list (or an array), but when I check the length of that field I get 1 (not 2, as I expected):
print("Lenght: ",str(len(df['columnA'])))
Length: 1
What do I need to do to get a value of 2 and therefore be able to iterate through that array?
For example, I would be able to do this iteration:
for i in range(len(df['columnA'])):
print(i)
Result:
0
1
Can anyone help me, please?
You are on right track, instead of direct using len() on dataframe, you have to take values and then apply
print("Lenght: ",len(df['columnA'].values[0]))
for item in df["columnA"]:
for num in item:
print(num)
This will iterate directly over the column
When calling a function using groupby + apply, I want to go from a DataFrame to a Series groupby object, apply a function to each group that takes a Series as input and returns a Series as output, and then assign the output from the groupby + apply call as a field in the DataFrame.
The default behavior is to have the output from groupby + apply indexed by the grouping fields, which prevents me from assigning it back to the DataFrame cleanly. I'd prefer to have the function I call with apply take a Series as input and return a Series as output; I think it's a bit cleaner than DataFrame to DataFrame. (This isn't the best way of getting to the result for this example; the real application is pretty different.)
import pandas as pd
df = pd.DataFrame({
'A': [999, 999, 111, 111],
'B': [1, 2, 3, 4],
'C': [1, 3, 1, 3]
})
def less_than_two(series):
# Intended for series of length 1 in this case
# But not intended for many-to-one generally
return series.iloc[0] < 2
output = df.groupby(['A', 'B'])['C'].apply(less_than_two)
I want the index on output to be the same as df, otherwise I cant assign
to df (cleanly):
df['Less_Than_Two'] = output
Something like output.index = df.index seems too ugly, and using the group_keys argument doesn't seem to work:
output = df.groupby(['A', 'B'], group_keys = False)['C'].apply(less_than_two)
df['Less_Than_Two'] = output
transform returns the results with the original index, just as you've asked for. It will broadcast the same result across all elements of a group. Caveat, beware that the dtype may be inferred to be something else. You may have to cast it yourself.
In this case, in order to add another column, I'd use assign
df.assign(
Less_Than_Two=df.groupby(['A', 'B'])['C'].transform(less_than_two).astype(bool))
A B C Less_Than_Two
0 999 1 1 True
1 999 2 3 False
2 111 3 1 True
3 111 4 3 False
Assuming your groupby is necessary (and the resulting groupby object will have fewer rows than your DataFrame -- this isn't the case with the example data), then assigning the Series to the 'Is.Even' column will result in NaN values (since the index to output will be shorter than the index to df).
Instead, based on the example data, the simplest approach will be to merge output -- as a DataFrame -- with df, like so:
output = df.groupby(['A','B'])['C'].agg({'C':is_even}).reset_index() # reset_index restores 'A' and 'B' from indices to columns
output.columns = ['A','B','Is_Even'] #rename target column prior to merging
df.merge(output, how='left', on=['A','B']) # this will support a many-to-one relationship between combinations of 'A' & 'B' and 'Is_Even'
# and will thus properly map aggregated values to unaggregated values
Also, I should note that you're better off using underscores than dots in variable names; unlike in R, for instance, dots act as operators for accessing object properties, and so using them in variable names can block functionality/create confusion.
I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes