How to match 2 different array with the same length - python
Good Morning,
I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet.
Here's a example below
df= np.array([[[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46]],
[[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10]],
[[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56]]])
df1= np.array([[[89,2,4,6,8,10,12],
[81,2,7,25,28,33,54],
[91,2,13,17,33,45,48]],
[[66,2,20,24,25,36,38],
[56,1,24,33,43,53,65],
[78,3,14,15,18,19,20]],
[[120,4,10,23,25,26,38],
[59,5,22,35,36,38,40],
[125,1,17,18,32,44,56]]])
I would like to find a match in.
100 to 89,81,91
87 to 89,81,91
99 to 89,81,91
64 to 66,56,78
55 to 66,56,78
66 to 66,56,78
77 to 120,59,125
811 to 120,59,125
118 to 120,59,125
Output/Save to excel sheet.
Related
Get column values for containing a value
I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked. The first 10 rows of the csv look like: id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A 4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[]. The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows. import pandas as pd df = pd.read_csv('test.txt') df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col') df = df.loc[df['value']=='X'].drop(columns='value') print(df) Output id account_id date amount payments x_col 0 4959 2 1994-01-05 80952 3373 24_A 5 4973 67 1996-05-02 165960 6915 24_A 11 4961 19 1996-04-29 30276 2523 12_B 22 4962 25 1997-12-08 30276 2523 12_A 26 4986 97 1997-08-10 102876 8573 12_A 33 4967 37 1998-10-14 318480 5308 60_D 44 4968 38 1998-04-19 110736 2307 48_C 48 4989 105 1998-12-05 352704 7348 48_C 57 4988 103 1997-12-06 265320 7370 36_D 69 4990 110 1997-09-08 162576 4516 36_C
Extract information from an Excel (by updating arrays) with Excel / Python
I have an Excel file with thousands of columns on the following format: Member No. X Y Z 1000 25 60 -30 -69 38 68 45 2 43 1001 24 55 79 4 -7 89 78 51 -2 1002 45 -55 149 94 77 -985 -2 559 56 I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like: Member No. X Y Z 1000 69 60 68 1001 78 55 89 1002 94 559 985 I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000). I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column? Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python. I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe. Assuming your dataframe is called data: import pandas as pd data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int) df = abs(df) res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index() Which will print you: Member No. X Y Z A B C 0 1000 69 60 68 60 74 69 1 1001 78 55 89 78 92 87 2 1002 94 559 985 985 971 976 Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
question on importing multiple csv files in python system
I have a question on importing multiple csv files such that they are vertically stacked into a column array. [Here is a a sample; all the files look the same]: yyyymm count_neg count_pos count_all score 200301 114 67 7470 0.006291834 200303 106 51 3643 0.015097447 200305 102 62 3925 0.010191083 200306 129 71 4964 0.011684126 200308 53 50 3819 0.000785546 200309 59 58 3926 0.000254712 200310 50 63 3734 -0.003481521 200312 75 55 4256 0.004699248 This particular set of data is from a excel sheet called 2003.csv I also have similar file names for years 2004, 2005, 2006 So again I am wondering how to get these into python such that: I vertically stack these csv into a column array Right now, all I know how to do is this: yr2003 = pandas.read_csv('2003.csv', header=0,parse_dates=True) While df = pd.concat([yr2003, yr2004, yr2005]) does indeed combine things, I am only looking to combine yyyymm and count_all score columns.
This should work df = pd.concat([yr2003, yr2004, yr2005])
read a text file which has key value pairs and convert each line as one dictionary using python pandas
I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples: 1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a 4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a 6=84|41=18|56=TTT|67=00|4=13|45=00|07=d I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so: [{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}] df = pd.read_csv("input.txt",names=['text'],header=None) data = df['text'].str.split("|") names=[ y.split('=') for x in data for y in x] ds=pd.DataFrame(names) print ds How can I create a dictionary for each line by splitting on the = symbol? It should be one row and multiple columns. The DataFrame should have all keys as rows and values as columns. Example: 1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07 88 1438 kkk 00 66 a na 1438 na .....
I think performing a .pivot would work. Try this: import pandas as pd df = pd.read_csv("input.txt",names=['text'],header=None) data = df['text'].str.split("|") names=[ y.split('=') for x in data for y in x] ds=pd.DataFrame(names) ds = ds.pivot(columns=0).fillna('') The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na'). Output: ds.head() 1 0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86 0 88 1 1438 2 KKK 3 00 4 00 For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).
How to separate extracted column from excel into different lists or groups in Python
I am reading a column from an excel file using openpyxl. I have written code to get the column of data I need into excel but the data is separated by empty cells. I want to group these data wherever the cell value is not None into 19 sets of countries so that I can use it later to calculate the mean and standard deviation for the 19 countries. I don't want to hard code it using list slices. Instead I want to save these integers to a list or list of lists using a loop but im not sure how to because this is my first project with Python. Here's my code: #Read PCT rankings project ratified results #Beta import openpyxl wb=openpyxl.load_workbook('PCT rankings project ratified results.xlsx', data_only=True) sheet=wb.get_sheet_by_name('PCT by IP firms') row_counter=sheet.max_row column_counter=sheet.max_column print(row_counter) print(column_counter) #iterating over column of patent filings and trying to use empty cells to flag loop for it to append/store list of numbers before reaching the next non empty cell and repeat this everytime it happens(expecting 19 times) list=[] for row in range(4,sheet.max_row +1): patent=sheet['I'+str(row)].value print(patent) if patent == None: list.append(patent) print(list) This is the output from Python giving you a visualisation of what I am trying to do. Column I: 412 14 493 488 339 273 238 226 200 194 153 164 151 126 None 120 None None 133 77 62 79 24 0 30 20 16 0 6 9 11 None None None None 608 529 435 320 266 264 200 272 134 113 73 23 12 52 21