writing a header with 125,000+ columns and two rows

writing a header with 125,000+ columns and two rows - python

Excel limits the columns of any csv file around 3000. I am trying to write 125,000 columns in the following format:
O1
MA1
MI1
C1
V1
...
O125000
MA125000
MI125000
C125000
V125000
import pandas as pd
def formatting(i):
return tuple(map(lambda x: x+str(i), ("O", "MA", "MI", "C", "V")))
l = []
for i in range(1, 125001):
l.extend(formatting(i))
f = pd.read_csv('file.csv')
f.columns = l
f.to_csv('new_file.csv')
I tried coding this script but its too slow and inconsistent in the fact that its prone to errors. However, you can get the idea of what I am trying to do from this script.
The current script I use to generate a csv(that contains 2 rows and 125,000+ columns) is the following:
import pandas as pd
import glob
allfiles = glob.glob('*.csv')
index = 0
def testing(file):
#file = file.loc[:,'Open':'Volume']
file = file.values.reshape(1, -1)
return file
for _fileT in allfiles:
nFile = pd.read_csv(_fileT, header=0, usecols=range(1,6))
fFile = testing(nFile)
df = pd.DataFrame(fFile)
new_df = df.iloc[:125279]
new_df = new_df.shift(1, axis=1)
new_df.to_csv('HeadCSV/FinalCSV.csv', mode='a', index=False, header=0)
This script reads two csv files in the directory, and aggregates them into one file however how can I make sure that it prints the header mentioned above and labels the two rows it prints out?
Id basically like to combine these two scripts in the most logical way possible.
the idea is to write the header, then get all the data from the files into the dataframe, then do the row indexing as mentioned, and finally throw it all into a CSV

Related

Combining CSVs with Python issue

I'm trying to combine a bunch of CSVs in a folder into one using Python. Each CSV has 9 columns but no headers. When they combine, some 'sheets' are spread far to the right in the sheet. So it seems they are not combining properly.
Please see code below
## Merge Multiple 1M Rows CSV files
import os
import pandas as pd
# 1. defines path to csv files
path = "C://halfordsCSV//new//Archive1/"
# 2. creates list with files to merge based on name convention
file_list = [path + f for f in os.listdir(path) if f.startswith('greyville_po-')]
# 3. creates empty list to include the content of each file converted to pandas DF
csv_list = []
# 4. reads each (sorted) file in file_list, converts it to pandas DF and appends it to the
csv_list
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
# 5. merges single pandas DFs into a single DF, index is refreshed
csv_merged = pd.concat(csv_list, ignore_index=True)
# 6. Single DF is saved to the path in CSV format, without index column
csv_merged.to_csv(path + 'halfordsOrders.csv', index=False)
It should be sticking to the same number of columns. Any idea what might be going wrong?

First, please check if separator and delimiter are fine in pandas.read_csv, default are ',' and None. You can pass them like that for example:
pandas.read_csv("my_file_path", sep=';', delimiter=',')
If they are already ok regarding to your csv files, try cleaning the dataframes before concating them
replace :
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
by :
nan_value = float("NaN")
for file in sorted(file_list):
my_df = pd.read_csv(file)
my_df.assign(File_Name = os.path.basename(file))
my_df.replace("", nan_value, inplace=True)
my_df.dropna(how='all', axis=1, inplace=True)
csv_list.append(my_df)

Read selected data from multiple files

I have 200 .txt files and need to extract one row data from each file and create a different dataframe.
For example (abc1.txt,abc2.txt, .etc) set of files and i need to extract 5th row data from each file and create a dataframe. When reading files, columns need to be separated by '/t' sign.
like this
data = pd.read_csv('abc1.txt', sep="\t", header=None)
I can not figure out how to do all this with a loop. Can you help?

Here is my answer:
import pandas as pd
from pathlib import Path
path = Path('path/to/dir')
files = path.glob('*.txt')
to_concat = []
for f in files:
df = pd.read_csv(f, sep="\t", header=None, nrows=5).loc[4:4]
to_concat.append(df)
result = pd.concat(to_concat)
I have used nrows to read only first 5 rows and then .loc[4:4] to get dataframe rather than series (when you use .loc[4].

Here you go:
import os
import pandas as pd
directory = 'C:\\Users\\PC\\Desktop\\datafiles\\'
aggregate = pd.DataFrame()
for filename in os.listdir(directory):
if filename.endswith(".txt"):
data = pd.read_csv(directory+filename, sep="\t", header=None)
row5 = pd.DataFrame(data.iloc[4]).transpose()
aggregate = aggregate.append(row5)

Append only unlike data from one .csv to another .csv

I have managed to use Python with the speedtest-cli package to run a speedtest of my Internet speed. I run this every 15 min and append the results to a .csv file I call "speedtest.csv". I then have this .csv file emailed to me every 12 hours, which is a lot of data.
I am only interested in keeping the rows of data that return less than 13mbps Download speed. Using the following code, I am able to filter for this data and append it to a second .csv file I call speedtestfilteronly.csv.
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
df.to_csv('c:\speedtestfilteronly.csv', mode='a', header=False)
The problem now is it appends all the rows that match my filter criteria every time I run this code. So if I run this code 4 times, I receive the same 4 sets of appended data in the "speedtestfilteronly.csv" file.
I am looking to only append unlike rows from speedtest.csv to speedtestfilteronly.csv.
How can I achieve this?
I have got the following code to work, except the only thing it is not doing is filtering the results to < 13000000.0 mb/s: Any other ideas?
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtest.csv')
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\emailspeedtest.csv', header=None, index=False)

There's a few different way you could approach this, one would be to read in your filtered dataset, append the new one in memory and then drop duplicates like this:
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtestfilteronly.csv', header=None)
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\speedtestfilteronly.csv', header=None, index=False)

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131

I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")

Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame

You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

writing a header with 125,000+ columns and two rows - python

Related

Combining CSVs with Python issue

Read selected data from multiple files

Append only unlike data from one .csv to another .csv

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

Categories

Resources