Concatenation of multiple .csv files not giving required result on python - python

I have a total of 24 .csv files each having 3 columns and a number of rows (15677 to be exact which are split in these 24 files) containing the data that I require to read.
I would like to access and read these data files in chronological order.
At first I tried to concatenate these files but for some reason I am obtaining a matrix that has [15653 rows x 72 columns] but actually it should be [15677 rows x 3 columns] (since all the .csv files have 3 columns and when you sum the rows of data contained altogether it amounts to 15677).
Here is what I have done till now that got me the result I mentioned:
import glob
import os
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "media/BIWI/*.csv"))))
print(df)
Files being used: https://drive.google.com/drive/folders/19z-OcHRXmTO8VX-Bj8NuOJGJROURLJwt?usp=sharing

When you use pd.read_csv() without providing names of columns, the first row of the csv file is used as column headers. In this way, you are losing 24 rows, which leaves 15677-24 = 15653 rows. Since the resulting DataFrames have columns with different names pd.concat() produces a DataFrame that contains all these columns (and a lot of NaN values). This accounts for 72=3*24 columns. To fix this, you can use pd.read_csv() with names argument and assign to it the list of column names. Alternatively, you can use pd.read_csv() with header=None to indicate that the csv files do not have header rows.

Related

Renaming some columns of a df using rows from another df (python; pandas)

I have two .csv files that I am using pandas library to read as df's in python.
First files has just one column with 15 rows that include title names. Second file has 20 columns and I want to rename the first 5 with my own created names and the last 15 columns using the 15 rows of the first file.
I have already saved the first file as df1 in python. Please tell me how I can save the second file as df2 while renaming the columns (I am using the name = code to rename the first 5 rows but I do not know how to incorporate a line of code that will rename the last 15 columns using df1)
I have two .csv files that I am using pandas library to read as df's in python.
First files has just one column with 15 rows that include title names. Second file has 20 columns and I want to rename the first 5 with my own created names and the last 15 columns using the 15 rows of the first file.
I have already saved the first file as df1 in python. Please tell me how I can save the second file as df2 while renaming the columns (I am using the name = code to rename the first 5 rows but I do not know how to incorporate a line of code that will rename the last 15 columns using df1)
what you can do here is make variables for both dataframes, ie, df1 and df2, then use a list like this:
Cols = ['name1','name2'.....,'name5'] + df1['colum_name'].tolist()
column name in the above sentence is the column containing the names of columns you want.
Then afterwards:
df2 = pd.read_csv('file_path', names=Cols, header=None)

How do I read only specific columns from a JSON dataframe?

I have a JSON dataframe with 12 columns, however, I only want to read columns 2 and 5 which are named "name" and "score."
Currently, the code I have is:
df = pd.read_json("path",orient='columns', lines=True)
print(df.head())
What that does is displays every column, as would be expected.
After reading through the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
I can't find any real way to only parse certain columns within json, compared to csv where you can parse columns using names=[]
pass a list of columns for indexing
df[["name","score"]]

Pandas: How to read contents of a CSV into a single column?

I want to read a file 'tos_year.csv' into a Pandas dataframe, such that all values are in one single column. I will later use pd.concat() to add this column to an existing dataframe.
The CSV file holds 80 entries in the form of years, i.e. "... 1966,1966,1966,1966,1967,1967,... "
What I can't figure out is how to read the values into one column with 80 rows, instead of 80 columns with one row.
This is probably quite basic but I'm new to this. Here's my code:
import pandas as pd
tos_year = pd.read_csv('tos_year.csv').T
tos_year.reset_index(inplace=True)
tos_year.columns = ['Year']
As you can see, I tried reading it in and then transposing the dataframe, but when it gets read in initially, the year numbers are interpreted as column names, and there apparently cannot be several columns with identical names, so I end up with a dataframe that holds str-values like
...
1966
1966.1
1966.2
1966.3
1967
1967.1
...
which is not what I want. So clearly, it's preferable to read it in correctly from the start.
Thanks for any advice!
Add header=None for avoid parse years like columns names, then transpose and rename column, e.g. by DataFrame.set_axis:
tos_year = pd.read_csv('tos_year.csv', header=None).T.set_axis(['Year'], axis=1)
Or:
tos_year = pd.read_csv('tos_year.csv', header=None).T
tos_year.columns = ['Year']

Read data from .dat file as rows and columns, ignore comments with loop

I have a .plt file with 110 datasets each with 11 rows and 9 columns. The datasets are separated by 3 rows of comments each time. I want to read the file as rows and columns. Panda reads it as rows but doesn't recognize the columns. This may be because the first three lines of the file are comments. One starting with '#' and the other two with '$'. How do I make panda convert this to a csv type file?
I have tried read_csv, read_fwf and put delimiters and comments and skipped the first three rows but it still recognizes all the columns as 1 column with index 0.
I ended up ignoring those lines by using
test = pd.read_table(filename, header=None)
list_a = test.iloc[start:end]

Most efficient way to compare two near identical CSV's in Python?

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.

Categories