Get column values for containing a value - python

I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-

There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C

Related

How to match 2 different array with the same length

Good Morning,
I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet.
Here's a example below
df= np.array([[[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46]],
[[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10]],
[[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56]]])
df1= np.array([[[89,2,4,6,8,10,12],
[81,2,7,25,28,33,54],
[91,2,13,17,33,45,48]],
[[66,2,20,24,25,36,38],
[56,1,24,33,43,53,65],
[78,3,14,15,18,19,20]],
[[120,4,10,23,25,26,38],
[59,5,22,35,36,38,40],
[125,1,17,18,32,44,56]]])
I would like to find a match in.
100 to 89,81,91
87 to 89,81,91
99 to 89,81,91
64 to 66,56,78
55 to 66,56,78
66 to 66,56,78
77 to 120,59,125
811 to 120,59,125
118 to 120,59,125
Output/Save to excel sheet.

Get the sum of absolutes of columns for a dataframe

If I have a dataframe and I want to sum the values of the columns I could do something like
import pandas as pd
studentdetails = {
"studentname":["Ram","Sam","Scott","Ann","John","Bobo"],
"mathantics" :[80,90,85,70,95,100],
"science" :[85,95,80,90,75,100],
"english" :[90,85,80,70,95,100]
}
index_labels=['r1','r2','r3','r4','r5','r6']
df = pd.DataFrame(studentdetails ,index=index_labels)
print(df)
df3 = df.sum()
print(df3)
col_list= ['studentname', 'mathantics', 'science']
print( df[col_list].sum())
How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns?
I tried abs in several way but it did not work
Edit:
studentname mathantics science english
r1 Ram 80 85 90
r2 Sam 90 95 -85
r3 Scott -85 80 80
r4 Ann 70 90 70
r5 John 95 -75 95
r6 Bobo 100 100 100
Expected output
mathantics 520
science 525
english 520
Edit2:
The col_list cannot include string value columns
You need numeric columns for absolute values:
col_list = df.columns.difference(['studentname'])
df[col_list].abs().sum()
df.set_index('studentname').abs().sum()
df.select_dtypes(np.number).abs().sum()

Extract information from an Excel (by updating arrays) with Excel / Python

I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.

Subtract a constant from a column in a pandas dataframe

I have a dataframe as follows:
year,value
1970,2.0729729191557147
1971,1.0184197388632872
1972,2.574009084167593
1973,1.4986879160266255
1974,3.0246498975934464
1975,1.7876222478238608
1976,2.5631745148930913
1977,2.444014336917563
1978,2.619502688172043
1979,2.268273809523809
1980,2.6086169818316645
1981,0.8452720174091145
1982,1.3158922171018947
1983,-0.12695212493599603
1984,1.4374230626622169
1985,2.389290834613415
1986,2.3489311315924217
1987,2.6002265745007676
1988,1.2623717711036955
1989,1.1793426779313878
I would like to subtract a constant from each of the values in the second column. This is the code I have tried:
df = pd.read_csv(f1, sep=",", header=0)
df2 = df["value"].subtract(1)
However when I do this, df2 becomes this:
70 1.072973
71 0.018420
72 1.574009
73 0.498688
74 2.024650
75 0.787622
76 1.563175
77 1.444014
78 1.619503
79 1.268274
80 1.608617
81 -0.154728
82 0.315892
83 -1.126952
84 0.437423
85 1.389291
86 1.348931
87 1.600227
88 0.262372
89 0.179343
The year becomes only the last two digits. How can I retain all of the digits of the year?
I think column year is not modified, only need assign back subtracted values:
df["value"] = df["value"].subtract(1)

read a text file which has key value pairs and convert each line as one dictionary using python pandas

I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples:
1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a
4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a
6=84|41=18|56=TTT|67=00|4=13|45=00|07=d
I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so:
[{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}]
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
print ds
How can I create a dictionary for each line by splitting on the = symbol?
It should be one row and multiple columns.
The DataFrame should have all keys as rows and values as columns.
Example:
1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07
88 1438 kkk 00 66 a
na 1438 na .....
I think performing a .pivot would work. Try this:
import pandas as pd
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
ds = ds.pivot(columns=0).fillna('')
The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na').
Output:
ds.head()
1
0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86
0 88
1 1438
2 KKK
3 00
4 00
For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).

Categories