While splitting data into columns, there was some glitch, due to which I have got some noisy data.
site code
--- ---
0 apple_123 45
1 apple_456 xy_33
2 facebook_123 24
3 google_123 NaN
4 google_123 pq_51
I need to clean the data, such that I get the following result:
site code
--- ---
0 apple_123 45
1 apple_456_xy 33
2 facebook_123 24
3 google_123 NaN
4 google_123_pq 51
I have been able to obtain the rows that need to be modified, but am unable to progress further:
import numpy as np
import pandas as pd
site = ['apple_123','apple_456','facebook_123','google_123','google_123']
code = [45,'xy_33',24,np.nan,'pq_51']
df = pd.DataFrame(list(zip(site,code)), columns=['site','code'])
df[(~df.code.astype(str).str.isdigit())&(~df.code.isna())]
Use Series.str.extract for get non numeric and numeric values to helper DataFrame and then processing each column separately - remove _ by Series.str.strip, add from right side by Series.radd and convert missing values to emty string, last add to code column, for second use Series.fillna for replace not mached values from 1 column to original:
df1 = df.code.str.extract('(\D+)(\d+)')
df['site'] += df1[0].str.strip('_').radd('_').fillna('')
df['code'] = df1[1].fillna(df['code'])
print (df)
site code
0 apple_123 45
1 apple_456_xy 33
2 facebook_123 24
3 google_123 NaN
4 google_123_pq 51
Related
I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])
I am looking to filter a dataframe to only include values that are equal to a certain value, or greater than another value.
Example dataframe:
0 1 2
0 0 1 23
1 0 2 43
2 1 3 54
3 2 3 77
From here, I want to pull all values from column 0, where column 2 is either equal to 23, or greater than 50 (so it should return 0, 1 and 2). Here is the code I have so far:
df = df[(df[2]=23) & (df[2]>50)]
This returns nothing. However, when I split these apart and run them individually (df = df[df[2]=23] and df = df[df[2]>50]), then I do get results back. Does anyone have any insights onto how to get this to work?
As you said , it's or : | not and: &
df = df[(df[2]=23) | (df[2]>50)]
I want to make a subset of my dataFrame object using pandas or any other python liberary using Hierarchical indexing that can be iterable depending on number of rows I have in one of the column.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv(address)
trajectory frame x y
1 1 447,956 2,219
1 2 447,839 2,327
1 3 449,183 1,795
1 4 450,444 1,833
1 5 448,514 1,708
1 6 451,532 1,832
1 7 448,471 1,759
1 8 450,028 2,097
1 9 448,215 2,203
1 10 449,311 2,063
1 11 451,745 1,76
1 12 450,827 2,264
1 13 448,991 2,208
1 14 452,829 3,106
1 15 448,688 1,77
1 16 449,844 1,951
1 17 450,044 1,991
1 18 449,835 1,901
1 19 450,793 3,49
1 20 449,618 2,354
2 1 445.936 7.219
2 2 442.879 3.327
3 1 441.283 9.795
4 1 447.956 2.219
4 3 447.839 2.327
4 6 449.183 1.795
In this DataFrame, let say there are 4 columns, names: 'trajectory', 'frame, 'x' and 'y'. Number of 'trajectory' can be different from one dataframe to another. Each 'trajectory' can have multiple frames between 1 and 20, where they can be sequential from 1-20 or with some missing frames as well. Each frame has its own value in the column 'x' and 'y'.
My aim is to create a new dataframe where I can have only those 'trajectory' where the 'frame' values is present for all the 20 rows. As the number of rows in 'trajectory' and 'frame' columns are changing, so I would like to have a code that can be used in such conditions.
df_1 = df.set_index(['trajectory','frame'], drop=False)
Here, I did a heirarchical indexing using 'trajectory' and 'frame' and then I found that 'trajectory' number 1 and 6 have 20 frames in them. So I could manually select them using the following code.
df_1_subset = df_1[(df1['traj']== 1)|(df1['trajectory']== 6)]
However, I have multiple csv files where in each Dataframe, the 'trajectory' that will have 20 rows in the 'frame' column will be different, so I will have to do this manually. I think, there must be a better way, but I just can not seem to find it. I am very new to coding and I would really appreciate anybody's help. Thank you very much in advance.
Iif need filter trajectory level for 1 or 6 use Index.get_level_values with Index.isin:
df_1_subset = df_1[df1.index.get_level_values('trajectory').isin([1,6])]
If need filter trajectory level for 1 and frame for 6 select with DataFrame.loc with tuple:
df_1_subset = df_1.loc[(1, 6)]
Alternative:
df_1_subset = df_1.loc[(df1.index.get_level_values('trajectory') == 1) |
(df1.index.get_level_values('frame') == 6)]
my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.
Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64
I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?
You can adress columns by index:
>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
a a
0 1 2
1 3 4
2 5 6
>>> df.iloc[:,0]
0 1
1 3
2 5
Or you can rename columns, like
>>> df.columns = ['a','b']
>>> df
a b
0 1 2
1 3 4
2 5 6
This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.
In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.
You can also create copies of the columns with new names, such as:
dataframe["new_name"] = data_frame.ix[:, column_position].values
where column_position references the positional location of the column you're trying to get (not the name).
These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.
Another solution:
def remove_dup_columns(frame):
keep_names = set()
keep_icols = list()
for icol, name in enumerate(frame.columns):
if name not in keep_names:
keep_names.add(name)
keep_icols.append(icol)
return frame.iloc[:, keep_icols]
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])
print(frame)
print(remove_dup_columns(frame))
The output is
A A B B
0 18 44 13 47
1 41 19 35 28
2 49 0 30 16
3 39 29 43 41
4 26 19 48 13
A B
0 18 13
1 41 35
2 49 30
3 39 43
4 26 48
The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't
def remove_multiples(df,varname):
"""
makes a copy of the first column of all columns with the same name,
deletes all columns with that name and inserts the first column again
"""
from copy import deepcopy
dfout = deepcopy(df)
if (varname in dfout.columns):
tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
del dfout[varname]
dfout[varname] = tmp
return dfout
where
[i for i,x in enumerate(dfout.columns == varname) if x]
is the part you need