I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
Related
I have a pandas Dataframe containing a time series of data.
Every full second contains a string with the name of the point. I want to rename the next 4 values, that contain a random point id, after this row to the name of the first row with an added suffix
time ID
12:00:00,00 pointname1
12:00:00,20 12345
12:00:00,40 45645
12:00:00,60 78963
12:00:00,80 23432
12:00:01,00 pointname2
12:00:01,20 53454
12:00:01,40 24324
12:00:01,60 24324
12:00:01,80 42435
I want to transform this into:
time ID
12:00:00,00 pointname1
12:00:00,20 pointname1_1
12:00:00,40 pointname1_2
12:00:00,60 pointname1_3
12:00:00,80 pointname1_4
12:00:01,00 pointname2
12:00:01,20 pointname2_1
12:00:01,40 pointname2_2
12:00:01,60 pointname2_3
12:00:01,80 pointname2_4
I have a working solution by iterating over the entire DataFrame, detecting the 'pointname' rows and renaming the 4 rows after that. However, that takes a very long time with the 1.3million rows the data contains. Is there a more clever and efficient way of doing this?
Use Series.str.startswith with Series.where for set missing values to not matched values and then forward filling them, last use counter by GroupBy.cumcount and add values without first:
df['ID'] = df['ID'].where(df['ID'].str.startswith('pointname')).ffill()
df['ID'] += df.groupby('ID').cumcount().astype(str).radd('_').replace('_0','')
print (df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
You can to_numeric (or str.startswith if your identifier is literal, the only important point is to have True for the rows to use as referenc) to identify the ID rows, then for all other rows use ffill and groupby.cumcount to make the new identifier:
# find rows with string identifier (could use other methods)
m = pd.to_numeric(df['ID'], errors='coerce').isna()
# or if "pointname" is literal
# m = df['ID'].str.startswith('pointname')
# for non matching rows, use previous value
# and add group counter
df.loc[~m, 'ID'] = (df['ID'].where(m).ffill()
+'_'
+df.groupby(m.cumsum()).cumcount().astype(str)
)
output:
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
You can groupby the time part of time column and transform ID column to add suffix to first value in each group.
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [f'{col.iloc[0]}_{s}' for s in ['']+list(range(1, len(col)))])
.str.rstrip('_'))
# or
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [col.iloc[0]] + [f'{col.iloc[0]}_{s}' for s in range(1, len(col))]))
print(df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.
I came across the below line of code, which gives an error when '.index' is not present in it.
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
What is the purpose of '.index' while using drop in pandas?
As explained in the documentation, you can use drop with index:
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.drop([0, 1]) # Here 0 and 1 are the index of the rows
Output:
A B C D
2 8 9 10 11
In this case it will drop the first 2 rows.
With .index in your example, you find the rows where Quantity=0and retrieve their index(and then use like in the documentation)
this is the detail about .drop() method:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
.drop() method need a parameter 'label' which is a list of index labels(when axis=0, which is the default case) or columns labels (when axis=1).
df[df['Quantity'] == 0] returns a DataFrame where Quantity=0, but what we need is the index label where Quantity=0, so .index is needed.
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it
Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.