question on importing multiple csv files in python system - python
I have a question on importing multiple csv files such that they are vertically stacked into a column array.
[Here is a a sample; all the files look the same]:
yyyymm count_neg count_pos count_all score
200301 114 67 7470 0.006291834
200303 106 51 3643 0.015097447
200305 102 62 3925 0.010191083
200306 129 71 4964 0.011684126
200308 53 50 3819 0.000785546
200309 59 58 3926 0.000254712
200310 50 63 3734 -0.003481521
200312 75 55 4256 0.004699248
This particular set of data is from a excel sheet called 2003.csv
I also have similar file names for years 2004, 2005, 2006
So again I am wondering how to get these into python such that: I vertically stack these csv into a column array
Right now, all I know how to do is this:
yr2003 = pandas.read_csv('2003.csv', header=0,parse_dates=True)
While df = pd.concat([yr2003, yr2004, yr2005]) does indeed combine things, I am only looking to combine yyyymm and count_all score columns.
This should work
df = pd.concat([yr2003, yr2004, yr2005])
Related
How to match 2 different array with the same length
Good Morning, I have 2 array with the same length over 21 rows. What I would like to find a match in df and df1 and skip to the next 3 to find a match etc.... Then I would like the return the Id's and matching values and then save to excel sheet. Here's a example below df= np.array([[[100,1,2,4,5,6,8], [87,1,6,20,22,23,34], [99,1,12,13,34,45,46]], [[64,1,10,14,29,32,33], [55,1,22,13,23,33,35], [66,1,6,7,8,9,10]], [[77,1,2,3,5,6,8], [811,1,2,5,6,8,10], [118,1,7,8,22,44,56]]]) df1= np.array([[[89,2,4,6,8,10,12], [81,2,7,25,28,33,54], [91,2,13,17,33,45,48]], [[66,2,20,24,25,36,38], [56,1,24,33,43,53,65], [78,3,14,15,18,19,20]], [[120,4,10,23,25,26,38], [59,5,22,35,36,38,40], [125,1,17,18,32,44,56]]]) I would like to find a match in. 100 to 89,81,91 87 to 89,81,91 99 to 89,81,91 64 to 66,56,78 55 to 66,56,78 66 to 66,56,78 77 to 120,59,125 811 to 120,59,125 118 to 120,59,125 Output/Save to excel sheet.
Get column values for containing a value
I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked. The first 10 rows of the csv look like: id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A 4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- 4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[]. The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows. import pandas as pd df = pd.read_csv('test.txt') df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col') df = df.loc[df['value']=='X'].drop(columns='value') print(df) Output id account_id date amount payments x_col 0 4959 2 1994-01-05 80952 3373 24_A 5 4973 67 1996-05-02 165960 6915 24_A 11 4961 19 1996-04-29 30276 2523 12_B 22 4962 25 1997-12-08 30276 2523 12_A 26 4986 97 1997-08-10 102876 8573 12_A 33 4967 37 1998-10-14 318480 5308 60_D 44 4968 38 1998-04-19 110736 2307 48_C 48 4989 105 1998-12-05 352704 7348 48_C 57 4988 103 1997-12-06 265320 7370 36_D 69 4990 110 1997-09-08 162576 4516 36_C
How do I sort columns of numerical file data in python
I'm trying to write a piece of code in python to graph some data from a tab separated file with numerical data. I'm very new to Python so I would appreciate it if any help could be dumbed down a little bit. Basically, I have this file and I would like to take two columns from it, sort them each in ascending order, and then graph those sorted columns against each other.
First of all, you should not put code as images, since there is a functionality to insert and format here in the editor. It's as simple as calling x.sort() and y.sort() since both of them are slices from data so that should work fine (assuming they are 1 dimensional arrays). Here is an example: import numpy as np array = np.random.randint(0,100, size=50) print(array) Output: [89 47 4 10 29 21 91 95 32 12 97 66 59 70 20 20 36 79 23 4] So if we use the method mentioned before: print(array.sort()) Output: [ 4 4 10 12 20 20 21 23 29 32 36 47 59 66 70 79 89 91 95 97] Easy as that :)
Extract information from an Excel (by updating arrays) with Excel / Python
I have an Excel file with thousands of columns on the following format: Member No. X Y Z 1000 25 60 -30 -69 38 68 45 2 43 1001 24 55 79 4 -7 89 78 51 -2 1002 45 -55 149 94 77 -985 -2 559 56 I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like: Member No. X Y Z 1000 69 60 68 1001 78 55 89 1002 94 559 985 I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000). I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column? Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python. I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe. Assuming your dataframe is called data: import pandas as pd data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int) df = abs(df) res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index() Which will print you: Member No. X Y Z A B C 0 1000 69 60 68 60 74 69 1 1001 78 55 89 78 92 87 2 1002 94 559 985 985 971 976 Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
Analysing Json file in Python using pandas
I have to analyse a lot of data doing my Bachelors project. The data will be handed to me in .json files. My supervisor has told me that it should be fairly easy if I just use Pandas. Since I am all new to Python (I have decent experience with MatLab and C though) I am having a rough start. If someone would be so kind to explain me how to do this I would really appreciate it. The files look like this: {"columns":["id","timestamp","offset_freq","reprate_freq"], "index":[0,1,2,3,4,5,6,7 ... "data":[[526144,1451900097533,20000000.495000001,250000093.9642499983],[... need to import the data and analyse it (make some plots), but I'm not sure how to import data like this.. Ps. I have Python and the required packages installed.
You did not give the full format of JSON file, but if it looks like {"columns":["id","timestamp","offset_freq","reprate_freq"], "index":[0,1,2,3,4,5,6,7,8,9], "data":[[39,69,50,51],[62,14,12,49],[17,99,65,79],[93,5,29,0],[89,37,42,47],[83,79,26,29],[88,17,2,7],[95,87,34,34],[40,54,18,68],[84,56,94,40]]} then you can do (I made up random numbers) df = pd.read_json(file_name_or_Python_string, orient='split') print df id timestamp offset_freq reprate_freq 0 39 69 50 51 1 62 14 12 49 2 17 99 65 79 3 93 5 29 0 4 89 37 42 47 5 83 79 26 29 6 88 17 2 7 7 95 87 34 34 8 40 54 18 68 9 84 56 94 40