I have a dataframe with a few columns Client ID, Prd No, Prd Weight. I made Client ID an index column as part of the process of transforming the data from wide to long using the wide_to_long method.
When I apply the sort_values method to the Prd No column, the arrangement is weird. It arranges it as 1, 10, 100, 101, 2, 20, 200....etc. How I want the data arranged is 1, 2, 3, 4...
I've tried all sorts of things, including explicitly changing the Prd No to an integer type using the astype() method, but no luck.
What could I be doing wrong? Is it a setting or version of Pandas I am using? Help, anyone?
import pandas as pd
df = pd.read_csv("new_export.csv")
df1 = pd.wide_to_long(df,['diameterbh', 'base_diam_1', 'length_1', 'top_diam_1', 'base_diam_2', 'length_2', 'top_diam_2', 'base_diam_3', 'length_3', 'top_diam_3', 'x_product'], i='uniqueID', j='Tree Number', sep='_')
df3 = df2[df2['diameterbh'].notnull()].fillna(value=0)sort_values(by="Tree Number")
A bit of code could be useful. You should indeed change the index to numeric before sorting. Be sure to update your dataframe, because by default pandas returns a copy of your dataframe. This works for me.
import pandas as pd
# you want to convert column a to index, and sort numerically
x = pd.DataFrame({'a': ['10', '2', '12', '1'], 'b': [100, 20, 120, 10]})
x.set_index('a', inplace=True) # Set the index (inplace to overwrite x)
x.index = x.index.astype(int) # Make sure to change the index to a numeric type
x.sort_index(inplace=True) # Again, inplace to prevent returning a copy
This makes x into a dataframe with a properly sorted index.
Related
I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.
I have a DataFrame with some missing values, that I need to replace with other values from a different dataframe.
I can do this with apply, but it is very slow as there is a lot of data. I suspect that it is very slow, because apply loops over all the rows and has to perform the pd.isnull check in every function call.
As there are not that many NaN values I thought that maybe the pandas where function would be a faster alternative. However, this did not work as I thought it would work, considering how apply works.
I created a reduced example as below. (As you can see, the indices are not unique, but a group belonging to a key is relatively small in comparison with the whole dataset.):
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['x', 'a', 10], ['x', 'b', np.nan], ['y', 'b', 20]], dtype=object), columns=['collection', 'subpart', 'freq']).set_index('collection')
df_other = pd.DataFrame(np.array([['x', 'a', 40], ['x', 'b', 30], ['x', 'c', 50]], dtype=object), columns=['collection', 'subpart', 'freq']).set_index('collection')
# This works, but is too slow:
df.freq = df.apply(lambda row: df_other.loc[row.name].pipe(lambda df: df[df.subpart == row.subpart]).freq.values[0] if pd.isnull(row.freq) else row.freq, axis=1)
# I hoped to optimize it like this, but throws error:
df.where(pd.isnull, lambda row: df_other.loc[row.name].pipe(lambda df: df[df.subpart == row.subpart]).freq.values[0], axis=1)
The last line of code here throws an "AttributeError 'DataFrame' object has no attribute 'name'". Seems the 'axis' argument here has a different meaning than in apply.
So, my question is: Can I make pandas where as I intend to use it? With the same result as apply. I would also accept any solution that optimized what I try to do in other way.
PS. As the data from the different dataframes has a different shape I cannot use someting like combine_first
This is update. We just need to add subpart to the Index so that's included in the alignment. As #DanielMesejo points out, we need to specify overwrite=False as to not change existing non-null data.
df = df.set_index('subpart', append=True)
df.update(df_other.set_index('subpart', append=True), overwrite=False)
df = df.reset_index('subpart')
Also with .fillna since we only need to fill missing values:
df = (df.set_index('subpart', append=True)
.fillna(df_other.set_index('subpart', append=True))
.reset_index('subpart'))
subpart freq
collection
x a 10
x b 30
y b 20
I have a dataframe that looks like the following.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Lists':["4,67,3,4,53,32", "7,3,44,2,5,6,9", "8,9,23", "9,36,21,32"]}
# Create DataFrame
df = pd.DataFrame(data)
I want to keep the rows where each list 'Lists' has any value in the pre-defined list [1,2,3,4,5]
What would be the most efficient and rapid way of doing it.
I'd like to avoid a for loop, and asking your proficiency in pandas df to ask you what's the best way to achieve this.
In the example above, this would keep only the rows for 'Tom' and 'nick'.
Many thanks!
This would work:
values = set(str(i) for i in [1, 2, 3, 4, 5]) # note the set
idx = df['Lists'].str.split(',').map(lambda x: len(values.intersection(x)) > 0)
df.loc[idx, 'Name']
0 Tom
1 nick
Name: Name, dtype: object
First convert the values to a set for faster membership tests (if you have many values), then filter rows where 'Lists' intersects the values.
import pandas as pd
df = pd.DataFrame({
'col1': [99, None, 99],
'col2': [4, 5, 6],
'col3': [7, None, None]})
col_list = ['col1', 'col2']
df[col_list].dropna(axis=1, thresh=2, inplace = True)
This returns a warning and leaves the dataframe unchanged:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
The following generates no warning but still leaves the DataFrame unchanged.
df.loc[:,col_list].dropna(axis=1, thresh=2, inplace=True)
Problem:
From among a list of columns specified by the user, remove those columns from the dataframe which have less than 'thresh' non-null vales. Make no changes to the columns that are not in the list.
I need to use inplace=True to avoid making a copy of the dataframe, since it is huge
I cannot loop over the columns and apply dropna one column at a time, because pandas.Series.dropna does not have the 'thresh' argument.
Funnily enough, dropna does not support this functionality, but there is a workaround.
v = df[col_list].notna().sum().le(2) # thresh=2
df.drop(v.index[v], axis=1, inplace=True)
By the way,
I need to use inplace=True to avoid making a copy of the dataframe
I'm sorry to inform you that even with inplace=True, a copy is generated. The only difference is that the copy is assigned back to the original object in-place, so a new object is not returned.
I think the problem is df['col_list'] or the slicing creates a new df and inplace=True effects on that df and not on the original one.
You might have to use subset param of dropna and pass the column list to it.
df.dropna(axis=1, thresh=2, subset=col_list,inplace = True)
How do I copy multiple columns from one dataframe to a new dataframe? it would also be nice to rename them at the same time
df2['colA']=df1['col-a'] #This works
df2['colA', 'colB']=df1['col-a', 'col-b'] #Tried and Failed
Thanks
You have to use double brackets:
df2[['colA', 'colB']] = df1[['col-a', 'col-b']]
This is tried and tested for pandas=1.3.0 :
df2 = df1[['col-a', 'col-b']].copy()
If you also want to rename the column names at the same time, you can write:
df2 = pd.DataFrame(columns=['colA', 'colB'])
df2[['colA', 'colB']] = df1[['col-a', 'col-b']]
Following also works:
# original DataFrame
df = pd.DataFrame({'a': ['hello', 'cheerio', 'hi', 'bye'], 'b': [1, 0, 1, 0]})
# new DataFrame created from 2 original cols (new cols renamed)
df_new = pd.DataFrame(columns=['greeting', 'mode'], data=df[['a','b']].values)
If you want to use condition for the new dataframe:
df_new = pd.DataFrame(columns=['farewell', 'mode'], data=df[df['b']==0][['a','b']].values)
Or if you want use just particular rows (index), you can use "loc":
df_new = pd.DataFrame(columns=['greetings', 'mode'], data=df.loc[2:3][['a','b']].values)
# if you need preserve row index, then add index=... as argument, like:
df_new = pd.DataFrame(columns=['farewell', 'mode'], data=df.loc[2:3][['a','b']].values,
index=df.loc[2:3].index )
As for Pandas=1.2.4, the easiest way would be:
df2[['colA', 'colB']] = df1[['col-a', 'col-b']].values
note that it doesn't require .copy() as applying values first converts dataframe values into a numpy array (shallow copy) and then copy values (link the address of array in the memory) into dataframe. Exactly the same as when you apply copy() at the end of it.