How to match 2 different array with the same length - python

Good Morning,
I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet.
Here's a example below
df= np.array([[[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46]],
[[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10]],
[[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56]]])
df1= np.array([[[89,2,4,6,8,10,12],
[81,2,7,25,28,33,54],
[91,2,13,17,33,45,48]],
[[66,2,20,24,25,36,38],
[56,1,24,33,43,53,65],
[78,3,14,15,18,19,20]],
[[120,4,10,23,25,26,38],
[59,5,22,35,36,38,40],
[125,1,17,18,32,44,56]]])
I would like to find a match in.
100 to 89,81,91
87 to 89,81,91
99 to 89,81,91
64 to 66,56,78
55 to 66,56,78
66 to 66,56,78
77 to 120,59,125
811 to 120,59,125
118 to 120,59,125
Output/Save to excel sheet.

Related

Get column values for containing a value

I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C

Extract information from an Excel (by updating arrays) with Excel / Python

I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.

question on importing multiple csv files in python system

I have a question on importing multiple csv files such that they are vertically stacked into a column array.
[Here is a a sample; all the files look the same]:
yyyymm count_neg count_pos count_all score
200301 114 67 7470 0.006291834
200303 106 51 3643 0.015097447
200305 102 62 3925 0.010191083
200306 129 71 4964 0.011684126
200308 53 50 3819 0.000785546
200309 59 58 3926 0.000254712
200310 50 63 3734 -0.003481521
200312 75 55 4256 0.004699248
This particular set of data is from a excel sheet called 2003.csv
I also have similar file names for years 2004, 2005, 2006
So again I am wondering how to get these into python such that: I vertically stack these csv into a column array
Right now, all I know how to do is this:
yr2003 = pandas.read_csv('2003.csv', header=0,parse_dates=True)
While df = pd.concat([yr2003, yr2004, yr2005]) does indeed combine things, I am only looking to combine yyyymm and count_all score columns.
This should work
df = pd.concat([yr2003, yr2004, yr2005])

read a text file which has key value pairs and convert each line as one dictionary using python pandas

I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples:
1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a
4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a
6=84|41=18|56=TTT|67=00|4=13|45=00|07=d
I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so:
[{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}]
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
print ds
How can I create a dictionary for each line by splitting on the = symbol?
It should be one row and multiple columns.
The DataFrame should have all keys as rows and values as columns.
Example:
1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07
88 1438 kkk 00 66 a
na 1438 na .....
I think performing a .pivot would work. Try this:
import pandas as pd
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
ds = ds.pivot(columns=0).fillna('')
The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na').
Output:
ds.head()
1
0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86
0 88
1 1438
2 KKK
3 00
4 00
For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).

How to separate extracted column from excel into different lists or groups in Python

I am reading a column from an excel file using openpyxl.
I have written code to get the column of data I need into excel but the data is separated by empty cells.
I want to group these data wherever the cell value is not None into 19 sets of countries so that I can use it later to calculate the mean and standard deviation for the 19 countries.
I don't want to hard code it using list slices. Instead I want to save these integers to a list or list of lists using a loop but im not sure how to because this is my first project with Python.
Here's my code:
#Read PCT rankings project ratified results
#Beta
import openpyxl
wb=openpyxl.load_workbook('PCT rankings project ratified results.xlsx', data_only=True)
sheet=wb.get_sheet_by_name('PCT by IP firms')
row_counter=sheet.max_row
column_counter=sheet.max_column
print(row_counter)
print(column_counter)
#iterating over column of patent filings and trying to use empty cells to flag loop for it to append/store list of numbers before reaching the next non empty cell and repeat this everytime it happens(expecting 19 times)
list=[]
for row in range(4,sheet.max_row +1):
patent=sheet['I'+str(row)].value
print(patent)
if patent == None:
list.append(patent)
print(list)
This is the output from Python giving you a visualisation of what I am trying to do.
Column I:
412
14
493
488
339
273
238
226
200
194
153
164
151
126
None
120
None
None
133
77
62
79
24
0
30
20
16
0
6
9
11
None
None
None
None
608
529
435
320
266
264
200
272
134
113
73
23
12
52
21

Categories