Get column values for containing a value - python
I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C
Related
How to match 2 different array with the same length
Good Morning, I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet. Here's a example below df= np.array([[[100,1,2,4,5,6,8], [87,1,6,20,22,23,34], [99,1,12,13,34,45,46]], [[64,1,10,14,29,32,33], [55,1,22,13,23,33,35], [66,1,6,7,8,9,10]], [[77,1,2,3,5,6,8], [811,1,2,5,6,8,10], [118,1,7,8,22,44,56]]]) df1= np.array([[[89,2,4,6,8,10,12], [81,2,7,25,28,33,54], [91,2,13,17,33,45,48]], [[66,2,20,24,25,36,38], [56,1,24,33,43,53,65], [78,3,14,15,18,19,20]], [[120,4,10,23,25,26,38], [59,5,22,35,36,38,40], [125,1,17,18,32,44,56]]]) I would like to find a match in. 100 to 89,81,91 87 to 89,81,91 99 to 89,81,91 64 to 66,56,78 55 to 66,56,78 66 to 66,56,78 77 to 120,59,125 811 to 120,59,125 118 to 120,59,125 Output/Save to excel sheet.
Get the sum of absolutes of columns for a dataframe
If I have a dataframe and I want to sum the values of the columns I could do something like import pandas as pd studentdetails = { "studentname":["Ram","Sam","Scott","Ann","John","Bobo"], "mathantics" :[80,90,85,70,95,100], "science" :[85,95,80,90,75,100], "english" :[90,85,80,70,95,100] } index_labels=['r1','r2','r3','r4','r5','r6'] df = pd.DataFrame(studentdetails ,index=index_labels) print(df) df3 = df.sum() print(df3) col_list= ['studentname', 'mathantics', 'science'] print( df[col_list].sum()) How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns? I tried abs in several way but it did not work Edit: studentname mathantics science english r1 Ram 80 85 90 r2 Sam 90 95 -85 r3 Scott -85 80 80 r4 Ann 70 90 70 r5 John 95 -75 95 r6 Bobo 100 100 100 Expected output mathantics 520 science 525 english 520 Edit2: The col_list cannot include string value columns
You need numeric columns for absolute values: col_list = df.columns.difference(['studentname']) df[col_list].abs().sum() df.set_index('studentname').abs().sum() df.select_dtypes(np.number).abs().sum()
Extract information from an Excel (by updating arrays) with Excel / Python
I have an Excel file with thousands of columns on the following format: Member No. X Y Z 1000 25 60 -30 -69 38 68 45 2 43 1001 24 55 79 4 -7 89 78 51 -2 1002 45 -55 149 94 77 -985 -2 559 56 I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like: Member No. X Y Z 1000 69 60 68 1001 78 55 89 1002 94 559 985 I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000). I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column? Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python. I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe. Assuming your dataframe is called data: import pandas as pd data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int) df = abs(df) res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index() Which will print you: Member No. X Y Z A B C 0 1000 69 60 68 60 74 69 1 1001 78 55 89 78 92 87 2 1002 94 559 985 985 971 976 Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
Subtract a constant from a column in a pandas dataframe
I have a dataframe as follows: year,value 1970,2.0729729191557147 1971,1.0184197388632872 1972,2.574009084167593 1973,1.4986879160266255 1974,3.0246498975934464 1975,1.7876222478238608 1976,2.5631745148930913 1977,2.444014336917563 1978,2.619502688172043 1979,2.268273809523809 1980,2.6086169818316645 1981,0.8452720174091145 1982,1.3158922171018947 1983,-0.12695212493599603 1984,1.4374230626622169 1985,2.389290834613415 1986,2.3489311315924217 1987,2.6002265745007676 1988,1.2623717711036955 1989,1.1793426779313878 I would like to subtract a constant from each of the values in the second column. This is the code I have tried: df = pd.read_csv(f1, sep=",", header=0) df2 = df["value"].subtract(1) However when I do this, df2 becomes this: 70 1.072973 71 0.018420 72 1.574009 73 0.498688 74 2.024650 75 0.787622 76 1.563175 77 1.444014 78 1.619503 79 1.268274 80 1.608617 81 -0.154728 82 0.315892 83 -1.126952 84 0.437423 85 1.389291 86 1.348931 87 1.600227 88 0.262372 89 0.179343 The year becomes only the last two digits. How can I retain all of the digits of the year?
I think column year is not modified, only need assign back subtracted values: df["value"] = df["value"].subtract(1)
read a text file which has key value pairs and convert each line as one dictionary using python pandas
I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples: 1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a 4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a 6=84|41=18|56=TTT|67=00|4=13|45=00|07=d I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so: [{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}] df = pd.read_csv("input.txt",names=['text'],header=None) data = df['text'].str.split("|") names=[ y.split('=') for x in data for y in x] ds=pd.DataFrame(names) print ds How can I create a dictionary for each line by splitting on the = symbol? It should be one row and multiple columns. The DataFrame should have all keys as rows and values as columns. Example: 1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07 88 1438 kkk 00 66 a na 1438 na .....
I think performing a .pivot would work. Try this: import pandas as pd df = pd.read_csv("input.txt",names=['text'],header=None) data = df['text'].str.split("|") names=[ y.split('=') for x in data for y in x] ds=pd.DataFrame(names) ds = ds.pivot(columns=0).fillna('') The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na'). Output: ds.head() 1 0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86 0 88 1 1438 2 KKK 3 00 4 00 For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).