Get the particular rows in Python - python
I have two csv files.
One is as follows:
"CONS_NO","DATA_DATE","KWH_READING","KWH_READING1","KWH"
"1652714033","2015/1/12","4747.3800","4736.8000","10.5800"
"3332440062","2015/1/12","408.6800","407.8200","0.8600"
"7804314033","2015/1/12","1794.3500","1792.5000","1.8500"
"0114314033","2015/1/12","3525.2000","3519.4400","5.7600"
"1742440062","2015/1/12","3097.1900","3091.4100","5.7800"
"8230100023","2015/1/12","1035.0500","1026.8400","8.2100"
About six million rows in all.
The other is as follows:
6360609057
8771218657
1338004100
2500009393
9184968250
9710581700
8833903141
About 10 thousand rows in all.
The second csv file has the CONS_NO only. I want to find the rows in the first csv file corresponding to the number in the second csv file; and delete the other rows in the first csv file in Python.
You can merge the two DataFrame using the merge method in pandas.
I change your example data to the following:
test1.csv is:
"CONS_NO","DATA_DATE","KWH_READING","KWH_READING1","KWH"
"1652714033","2015/1/12","4747.3800","4736.8000","10.5800"
"3332440062","2015/1/12","408.6800","407.8200","0.8600"
"7804314033","2015/1/12","1794.3500","1792.5000","1.8500"
"8833903141","2015/1/12","3525.2000","3519.4400","5.7600"
"1742440062","2015/1/12","3097.1900","3091.4100","5.7800"
"8833903141","2015/1/12","1035.0500","1026.8400","8.2100"
`test2.csv' is:
6360609057
8771218657
1338004100
2500009393
9184968250
9710581700
8833903141
you can now merge them using the following code:
import pandas as pd
df1 = pd.read_csv('test1.csv')
df2 = pd.read_csv('test2.csv', names=['CONS_NO'])
pd.merge(df1, df2, on='CONS_NO')
it gives the following output:
CONS_NO DATA_DATE KWH_READING KWH_READING1 KWH
0 8833903141 2015/1/12 3525.20 3519.44 5.76
1 8833903141 2015/1/12 1035.05 1026.84 8.21
Related
Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?
newbie python learner here! I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false. Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? : trialnum,colourtext,colourname,condition,response,rt,correct 1,blue,red,incongruent,red,0.767041,True 2,yellow,yellow,congruent,yellow,0.647259,True 3,green,blue,incongruent,blue,0.990185,True 4,green,green,congruent,green,0.720116,True 5,yellow,yellow,congruent,yellow,0.562909,True 6,yellow,yellow,congruent,yellow,0.538918,True 7,green,yellow,incongruent,yellow,0.693017,True 8,yellow,red,incongruent,red,0.679368,True 9,yellow,blue,incongruent,blue,0.951432,True 10,blue,blue,congruent,blue,0.633367,True 11,blue,green,incongruent,green,1.289047,True 12,green,green,congruent,green,0.668142,True 13,blue,red,incongruent,red,0.647722,True 14,red,blue,incongruent,blue,0.858307,True 15,red,red,congruent,red,1.820112,True 16,blue,green,incongruent,green,1.118404,True 17,red,red,congruent,red,0.798532,True 18,red,red,congruent,red,0.470939,True 19,red,blue,incongruent,blue,1.142712,True 20,red,red,congruent,red,0.656328,True 21,red,yellow,incongruent,yellow,0.978830,True 22,green,red,incongruent,red,1.316182,True 23,yellow,yellow,congruent,green,0.964292,False 24,green,green,congruent,green,0.683949,True 25,yellow,green,incongruent,green,0.583939,True 26,green,blue,incongruent,blue,1.474140,True 27,green,blue,incongruent,blue,0.569109,True 28,green,green,congruent,blue,1.196470,False 29,red,red,congruent,red,4.027546,True 30,blue,blue,congruent,blue,0.833177,True 31,red,red,congruent,red,1.019672,True 32,green,blue,incongruent,blue,0.879507,True 33,red,red,congruent,red,0.579254,True 34,red,blue,incongruent,blue,1.070518,True 35,blue,yellow,incongruent,yellow,0.723852,True 36,yellow,green,incongruent,green,0.978838,True 37,blue,blue,congruent,blue,1.038232,True 38,yellow,green,incongruent,yellow,1.366425,False 39,green,red,incongruent,red,1.066038,True 40,blue,red,incongruent,red,0.693698,True 41,red,blue,incongruent,blue,1.751062,True 42,blue,blue,congruent,blue,0.449651,True 43,green,red,incongruent,red,1.082267,True 44,blue,blue,congruent,blue,0.551023,True 45,red,blue,incongruent,blue,1.012258,True 46,yellow,green,incongruent,yellow,0.801443,False 47,blue,blue,congruent,blue,0.664119,True 48,red,green,incongruent,yellow,0.716189,False 49,green,green,congruent,yellow,0.630552,False 50,green,yellow,incongruent,yellow,0.721917,True 51,red,red,congruent,red,1.153943,True 52,blue,red,incongruent,red,0.571019,True 53,yellow,yellow,congruent,yellow,0.651611,True 54,blue,blue,congruent,blue,1.321344,True 55,green,green,congruent,green,1.159240,True 56,blue,blue,congruent,blue,0.861646,True 57,yellow,red,incongruent,red,0.793069,True 58,yellow,yellow,congruent,yellow,0.673190,True 59,yellow,red,incongruent,red,1.049320,True 60,red,yellow,incongruent,yellow,0.773447,True 61,red,yellow,incongruent,yellow,0.693554,True 62,red,red,congruent,red,0.933901,True 63,blue,blue,congruent,blue,0.726794,True 64,green,green,congruent,green,1.046116,True 65,blue,blue,congruent,blue,0.713565,True 66,blue,blue,congruent,blue,0.494177,True 67,green,green,congruent,green,0.626399,True 68,blue,blue,congruent,blue,0.711896,True 69,blue,blue,congruent,blue,0.460420,True 70,green,green,congruent,yellow,1.711978,False 71,blue,blue,congruent,blue,0.634218,True 72,yellow,blue,incongruent,yellow,0.632482,False 73,yellow,yellow,congruent,yellow,0.653813,True 74,green,green,congruent,green,0.808987,True 75,blue,blue,congruent,blue,0.647117,True 76,green,red,incongruent,red,1.791693,True 77,red,yellow,incongruent,yellow,1.482570,True 78,red,red,congruent,red,0.693132,True 79,red,yellow,incongruent,yellow,0.815830,True 80,green,green,congruent,green,0.614441,True 81,yellow,red,incongruent,red,1.080385,True 82,red,green,incongruent,green,1.198548,True 83,blue,green,incongruent,green,0.845769,True 84,yellow,blue,incongruent,blue,1.007089,True 85,green,blue,incongruent,blue,0.488701,True 86,green,green,congruent,yellow,1.858272,False 87,yellow,yellow,congruent,yellow,0.893149,True 88,yellow,yellow,congruent,yellow,0.569597,True 89,yellow,yellow,congruent,yellow,0.483542,True 90,yellow,red,incongruent,red,1.669842,True 91,blue,green,incongruent,green,1.158416,True 92,blue,red,incongruent,red,1.853055,True 93,green,yellow,incongruent,yellow,1.023785,True 94,yellow,blue,incongruent,blue,0.955395,True 95,yellow,yellow,congruent,yellow,1.303260,True 96,blue,yellow,incongruent,yellow,0.737741,True 97,yellow,green,incongruent,green,0.730972,True 98,green,red,incongruent,red,1.564596,True 99,yellow,yellow,congruent,yellow,0.978911,True 100,blue,yellow,incongruent,yellow,0.508151,True 101,red,green,incongruent,green,1.821969,True 102,red,red,congruent,red,0.818726,True 103,yellow,yellow,congruent,yellow,1.268222,True 104,yellow,yellow,congruent,yellow,0.585495,True 105,green,green,congruent,green,0.673404,True 106,blue,yellow,incongruent,yellow,1.407036,True 107,red,red,congruent,red,0.701050,True 108,red,green,incongruent,red,0.402334,False 109,red,green,incongruent,green,1.537681,True 110,green,yellow,incongruent,yellow,0.675118,True 111,green,green,congruent,green,1.004550,True 112,yellow,blue,incongruent,blue,0.627439,True 113,yellow,yellow,congruent,yellow,1.150248,True 114,blue,yellow,incongruent,yellow,0.774452,True 115,red,red,congruent,red,0.860966,True 116,red,red,congruent,red,0.499595,True 117,green,green,congruent,green,1.059725,True 118,red,red,congruent,red,0.593180,True 119,green,yellow,incongruent,yellow,0.855915,True 120,blue,green,incongruent,green,1.335018,True But I am only interested in the 'condition', 'rt', and 'correct' columns. I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table: Participant Stimulus Type Mean Reaction Time Percentage Correct 01 Congruent 0.560966 80 01 Incongruent 0.890556 64 02 Congruent 0.460576 89 02 Incongruent 0.956556 55 Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice! I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants? Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table: import os import glob import pandas as pd #set working directory os.chdir('data') #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig') Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table? Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes? I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this: from pathlib import Path # Use the Path class to represent a path. It offers more # functionalities when perform operations on paths path = Path("./data").resolve() # Create a dictionary whose keys are the Participant ID # (the `01` in `P01.csv`, etc), and whose values are # the data frames initialized from the CSV data = { p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv") } # Create a master data frame by combining the individual # data frames from each CSV file df = pd.concat(data, keys=data.keys(), names=["participant", None]) # Calculate the statistics result = ( df.groupby(["participant", "condition"]).agg(**{ "Mean Reaction Time": ("rt", "mean"), "correct": ("correct", "sum"), "size": ("trialnum", "size") }).assign(**{ "Percentage Correct": lambda x: x["correct"] / x["size"] }).drop(columns=["correct", "size"]) .reset_index() )
How to check if value of dataframe one exist in dataframe two and join two dataframes?
I have two csv file like the below: city.csv : City,Province aa,b bb,c ee,b customers.csv: Address, CustomerID John Smith aa blab blab, 234 Micheal Smith bb blab2 blab2, 123 I want join two csv files with pandas dataframe with the condion (if City in address). I try the below code: import pandas as pd df1 = pd.read_csv(r"city.csv") df2 = pd.read_csv(r"customers.csv") df1["City"] = df2.drop("Address", 1).isin(df2["Address"]).any(1) I follow this Q/A but it did not work for me. How to join these two csv files in pandas dataframe?
Use: pat = '|'.join(df1["City"].values) df2['col to join'] = df2['Address'].str.extract(f'({pat})')
Writing pandas column to csv without merging integers
I have extracted user_id against shop_ids as pandas dataframe from database using SQL query. user_id shop_ids 0 022221205 541 1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090 2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1... 3 023096446 4339,4339,3966,4339,4339 4 023098684 5004,3604,5004,5749,5004 I am trying to write this dataframe into csv using: df.to_csv('users_ordered_shops.csv') I end up with the csv merging the shop ids into one number as such: user_id shop_ids 0 22221205 541 1 23093087 508,844,604,460,446,000,000,000,000,000,000,000 2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000 3 23096446 43,394,339,396,643,300,000 4 23098684 50,043,604,500,457,400,000 The values for index 2 are: print(df.iloc[2].shop_ids) 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922 Expected output is a csv file with all shop_ids intact in one column or different columns like: user_id shop_ids 0 022221205 541 1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090 2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922 3 023096446 4339,4339,3966,4339,4339 4 023098684 5004,3604,5004,5749,5004 Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update To get one shop per column (and remove duplicates), you can use: pd.concat([df['user_id'], df['shop_ids'].apply(lambda x: sorted(set(x.split(',')))) .apply(pd.Series)], axis=1).to_csv('users_ordered_shops.csv', index=False) Change the delimiter. Try: df.to_csv('users_ordered_shops.csv', sep=';') Or change the quoting strategy: import csv df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)
Import multiple excel files, create a column and get values from excel file's name
I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114". Then I need to append them in one DataFrame. For this, I use the following code: all_data = pd.DataFrame() for f in glob.glob('C:\\path\\*.xlsx'): df = pd.read_excel(f) all_data = all_data.append(df,ignore_index=True) In fact, I do not need all data, but filtered by multiple columns. Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file. Example: Data from the excel file, named '20190101' Data from the excel file, named '20190115' The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name: like this: dataframes for trial: import pandas as pd df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'], 'price':[0,12.5,17.5,24.5,7.5], 'code':['r','r','r','c','r'] }) df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'], 'price':[7.5,24.5,0,149.5,7.5], 'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing: l=[] for f in glob.glob('C:\\path\\*.xlsx'): df=pd.read_excel(f) df['from']=os.path.split(f)[1][:-5] l.append(df[(df['code'].eq('r')&df['price'].ne(0))]) pd.concat(l,ignore_index=True) id price code from 0 id_2 12.5 r 20190101 1 id_3 17.5 r 20190101 2 id_5 7.5 r 20190101 3 id_1 7.5 r 20190115 4 id_2 24.5 r 20190115 5 id_5 7.5 r 20190115
Finding first and last rows in Pandas Dataframes for individual files
I have a Pandas Dataframe consisting of multiple .fits files, each one containing multiple columns with individual labels. I'd like to extract one column and create variables that contain the first and last rows of said column but I'm having a hard time accomplishing that for the individual .fits files and not just the entire Dataframe. Any help would be appreciated! :) Here is how I read in my files: path = '/Users/myname/folder/' m = [os.path.join(dirpath, f) for dirpath, dirnames, files in os.walk(path) for f in fnmatch.filter(files, '*.fits')] ^^^ This recursively searches through my directory containing multiple .fits files in many subfolders. dataframes = [] for ii in range(0,len(m)): data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t') d = pd.DataFrame(data) top = d['desired_column'].head() bottom = d['desired_column'].tail() First_and_Last = pd.concat([top,bottom]) I tried using the .head and .tail commands for Pandas Dataframes but I am unsure how to properly use it for what I desire. For how I read in my fits files, the following code gives me the very first few rows and the very last few rows (5 to be exact with the default value for head and tail being 5) as seen here: 0 2.456849e+06 1 2.456849e+06 2 2.456849e+06 3 2.456849e+06 4 2.456849e+06 1118 2.456852e+06 1119 2.456852e+06 1120 2.456852e+06 1121 2.456852e+06 1122 2.456852e+06 What I want to do is try to get the first and last row for each .fits file for the specific column I want and not just for the Dataframe containing the .fits files. With the way I am reading in my .fits files, the Dataframe seems to sort of concatenate all the files together. Any tips on how I can accomplish this goal?
If you want only the first row: top = d['desired_column'].head(1) If you want only the last row: bottom = d['desired_column'].tail(1) I didn't find the problem of "Dataframe seems to sort of concatenate all the files together." Would you please clarify the question? Btw, after data = pd.read_csv(m[ii], header = 'infer', delimiter = '\t'), data is already a DataFrame. Therefore, d = pd.DataFrame(data) is unnecessary.
The .iloc function should easily pull the top and bottom row, where df["col_1"] here below represents the column of interest: In [28]: import pandas as pd In [29]: import numpy as np In [30]: np.random.seed(42) In [31]: df = pd.DataFrame(np.random.randn(6,3), columns=["col_1", "col_2", "col_3"]) In [32]: df Out[32]: col_1 col_2 col_3 0 0.496714 -0.138264 0.647689 1 1.523030 -0.234153 -0.234137 2 1.579213 0.767435 -0.469474 3 0.542560 -0.463418 -0.465730 4 0.241962 -1.913280 -1.724918 5 -0.562288 -1.012831 0.314247 In [33]: pd.Series([df["col_1"].iloc[0], df["col_1"].iloc[-1]]) # pd.Series([top, bottom]) ; or pd.DataFrame([top, bottom]), if data frame needed. Out[33]: 0 0.496714 1 -0.562288 dtype: float64