Related
I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan
I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])
I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]
You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2
I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).
you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464
I have the following Pandas Dataframe**** in Python.
Temp_Fact Oscillops_read A B C D E F G H I J
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162
1 B Today 0.653489 0.452543 0.618755 0.555629 0.490342 0.280299 0.026055 0.138876 0.053148 0.899734
2 A Aactl 0.129211 0.579690 0.641324 0.615772 0.927384 0.199651 0.652395 0.249467 0.262301 0.049795
3 A DFE 0.743794 0.355085 0.637794 0.633634 0.810033 0.509244 0.470418 0.972145 0.647222 0.610636
4 C Real_Mt_Olv 0.724282 0.332965 0.063078 0.004550 0.585398 0.869376 0.232148 0.630162 0.102206 0.232981
5 E Q_Mont 0.221685 0.224834 0.110734 0.397999 0.814153 0.552924 0.981098 0.536750 0.251941 0.383994
6 D DFE 0.655386 0.561297 0.305310 0.140998 0.433054 0.118187 0.479206 0.556546 0.556017 0.025070
7 F Bryo 0.257884 0.228650 0.413149 0.285651 0.814095 0.275627 0.775620 0.392448 0.827725 0.935581
8 C Aactl 0.017388 0.133848 0.939049 0.159416 0.923788 0.375638 0.331078 0.939089 0.098718 0.785569
9 C Today 0.197419 0.595253 0.574718 0.373899 0.363200 0.289378 0.698455 0.252657 0.357485 0.020484
10 C Pars 0.037771 0.683799 0.184114 0.545062 0.857000 0.295918 0.733196 0.613165 0.180642 0.254839
11 B Pars 0.637346 0.090000 0.848710 0.596883 0.027026 0.792180 0.843743 0.461608 0.552165 0.215250
12 B Pars 0.768422 0.017828 0.090141 0.108061 0.456734 0.803175 0.454479 0.501713 0.687016 0.625260
13 E Tomorrow 0.860112 0.532859 0.091641 0.768896 0.635966 0.007211 0.656367 0.053136 0.482367 0.680557
14 D DFE 0.801734 0.365921 0.243407 0.826373 0.904416 0.062448 0.801726 0.049983 0.433135 0.351150
15 F Q_Mont 0.360710 0.330745 0.598830 0.582379 0.828019 0.467044 0.287276 0.470980 0.355386 0.404299
16 D Last_Week 0.867126 0.600093 0.813257 0.005423 0.617543 0.657219 0.635255 0.314910 0.016516 0.689257
17 E Last_Week 0.551499 0.724981 0.821087 0.175279 0.301397 0.304105 0.379553 0.971244 0.558719 0.154240
18 F Bryo 0.511370 0.208831 0.260223 0.089106 0.121442 0.120513 0.099722 0.750769 0.860541 0.838855
19 E Bryo 0.323441 0.663328 0.951847 0.782042 0.909736 0.512978 0.999549 0.225423 0.789240 0.155898
20 C Tomorrow 0.267086 0.357918 0.562190 0.700404 0.961047 0.513091 0.779268 0.030190 0.460805 0.315814
21 B Tomorrow 0.951356 0.570077 0.867533 0.365708 0.791373 0.232377 0.478656 0.003857 0.805882 0.989754
22 F Today 0.963750 0.118826 0.264858 0.571066 0.761669 0.967419 0.565773 0.468971 0.466120 0.174815
23 B Last_Week 0.291186 0.126748 0.154725 0.527029 0.021485 0.224272 0.259218 0.052286 0.205569 0.617701
24 F Aactl 0.269308 0.655920 0.595518 0.404817 0.290342 0.447246 0.627082 0.306856 0.868357 0.979879
I also have a Series of values for each column:
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
As you can see, this is a Pandas Series and it is the average, for each column, over rows where Oscillops_read == 'Last_Week'. Here is the Series:
[0.56993702256121603, 0.48394061768804786, 0.59635616273775061, 0.23591030688019868, 0.31347492150330231, 0.39519847430740507, 0.42467546792253791, 0.4461465888887961, 0.26026797943899194, 0.48706569569369912]
I also have 2 lists:
1.
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
This list gives values that must be added to the dataframe df under certain conditions (described below).
2.
col_1 = list('DFA')
col_2 = list('ACEF')
col_3 = list('CEF')
col_4 = list('ABDF')
col_5 = list('DEF')
col_6 = list('AC')
col_7 = list('ABCDE')
These are lists of column names. These columns from df must be compared to the average Series above. So for example, for the 6th list col_6, columns A and C from each row of the dataframe df must be compared to columns A and C of the Series.
Problem:
As I mentioned above, I need to compare specific columns from the dataframe df to the base Series df_base_val. The columns to be compared are listed in col_1, col_2, col_3, ..., col_7. Here is what I need to do:
if a row for the dataframe column names listed in col_1 (eg. if a row for columns A and C) is greater than the base Series df_base_val in those 2 columns then for that row, in a new column Range, enter the 6th value from the list range_name_list.
Example:
eg. use col_6 - this is the 6th list and it has column names A and C.
Step 1: For row 1 of df, columns A and C are greater than
df_base_val[A] and df_base_val[C] respectively.
Step 2: Thus, for row 1, in a new column Range, enter the 6th element from the list range_name_list - the 6th element is Barometer_Output.
Example Output:
After doing this, the 1st row becomes:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Barometer_Output'
Now, if this row were NOT to be greater than the Series in columns A and C and is not greater than the Series in columns from col_1, col_2, etc. then the Range column must be assigned the value 'Not_in_Range'. In that case, this row would become:
0 A Today 0.710213 0.222015 0.814710 0.597732 0.634099 0.338913 0.452534 0.698082 0.706486 0.433162 'Not_in_Range'
Simplification and Question:
In this example:
I only compared the 1st row to the base series. I need to compare
all rows of df individually to the base series and add an appropriate value.
I only used the 6th list of columns - this was col_6. Similarly, I need to go through each list of column names - col_1, col_2, ...., col_7.
If the row being compared is not greater than any of the lists col_1 to col_7, in the specified columns, then the column Range must be assigned the value 'Not_in_Range'.
Is there a way to do this? Maybe using loops?
**** to create the above dataframe, select it from above and copy. Then use the following code:
import pandas as pd
df = pd.read_clipboard()
print df
EDIT:
If multiple conditions are met, I would need that they all be listed. i.e. if the row belongs to 'Swg' and 'Curnt', then I would need to list both of these in the Range column, or to create separate Range columns, or just Python lists, for each matching result. Range1 would list 'Swg' and column Range2 would list 'Curnt', etc.
For starters I would create a dictionary with your condition sets where the keys can be used as indices for your range_name_list list:
conditions = {0: list('DFA'),
1: list('ACEF'),
2: list('CEF'),
3: list('ABDF'),
4: list('DEF'),
5: list('AC'),
6: list('ABCDE')}
The following code will then accomplish what I understand to be your task:
# Create your Range column to be filled in later.
df['Range'] = '|'
for index, row in df.iterrows():
for ix, list in conditions.iteritems():
# Create a list of the outcomes of checking whether the
# value for each condition column is greater than the
# df_base_val average.
truths = [row[column] > df_base_val[column] for column in list]
# See if all checks evaluated to True
if sum(truths) == len(truths):
# If so, set the 'Range' column's value for the current row
# to the appropriate 'range_name'
df.ix[index, 'Range'] = df.ix[index, 'Range'] + range_name_list[ix] + "|"
# Fill in all rows where no conditions were met with 'Not_in_Range'
df['Range'][df['Range'] == '|'] = 'Not_in_Range'
try this code:
df = pd.read_csv(BytesIO(txt), delim_whitespace=True)
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
columns = ['DFA', 'ACEF', 'CEF', 'ABDF', 'DEF', 'AC', 'ABCDE']
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']
ranges = pd.Series(["NOT_IN_RANGE" for _ in range(df.shape[0])], index=df.index)
for name, cols in zip(range_name_list, columns):
cols = list(cols)
idx = df.index[(df[cols] > df_base_val[cols]).all(axis=1)]
ranges[idx] = name
print ranges
But I don't know what you want if there are multiple range match with one row.