Dataframes not being concated properly - python

I am trying to concat multiple dataframes and delete the same rows. The code that i am running is the following one:
import pandas as pd
import os
import io
import magic
def extract_df_columns(annotated_tumor_only_variants_file):
if magic.from_file(annotated_tumor_only_variants_file) == 'Microsoft Excel 2007+':
df2 = pd.read_excel(annotated_tumor_only_variants_file)[['Gene.refGene','Start', 'End', 'Ref', 'Alt','Func.refGene']]
else:
with open(annotated_tumor_only_variants_file,'r') as f:
lines = [l for l in f]
df2 = pd.read_csv(io.StringIO(''.join(lines)), sep='\t')[['Gene.refGene','Start', 'End', 'Ref', 'Alt','Func.refGene']]
return(df2)
def unique_variants_whole_genome(folder_with_annotated_tumor_only_variants_files):
dflist = [extract_df_columns(folder_with_annotated_tumor_only_variants_files+'/'+x) for x in os.listdir(folder_with_annotated_tumor_only_variants_files)]
df = pd.concat(dflist).drop_duplicates(keep=False,ignore_index=True)
return(df)
Due to the fact that i have a lot of dataframes i am runing 10 dataframes per time with the above code. After that the individual dataframes where concated with the following code:
dflist = [df1_10, df11_20, df21_30, df31_40]
df1_40 = pd.concat(dflist).drop_duplicates(keep=False, ignore_index=True)
When i tried to conact the 40 dataframes alltogether i got a different result from the one that the mentioned above process gave me. Do you have any idea why this happened? If you could help me, i would be more than thankfull!!
Thanks, Eleni

Related

How to read list of parquets with partially overlapping set of columns in dask?

Consider this code:
import dask.dataframe as dd
import numpy as np
df1=pd.DataFrame({'A': [1, 2], 'B': [11, 12]})
df1.to_parquet("df1.parquet")
df2=pd.DataFrame({'A': [3, 4], 'C': [13, 14]})
df2.to_parquet("df2.parquet")
all_files = ["df1.parquet", "df2.parquet"]
full_df = dd.read_parquet(all_files)
# dask.compute(full_df) # KeyError: "['B'] not in index"
def normalize(df):
df_cols = set(df.columns)
for c in ['A', 'B', 'C']:
if c not in df_cols:
df[c] = np.nan
df = df[sorted(df.columns)]
return df
normal_df = full_df.map_partitions(normalize)
dask.compute(normal_df) # Still gives keyError
I was hoping that after the normalization using map_partitions, I wouldn't get keyError, but the read_parquet probably fails before reaching the map_partitions step.
I could have created the DataFrame from a list of delayed objects which would each read one file and normalize the columns, but I want to avoid using delayed objects for this reason
The other option is suggested by SultanOrazbayev is to use dask dataframe like this:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])
Problem with this is that, when all_files contains large number of files (10K) this takes a long time to create the dataframe since all those dd.read_parquet happens sequentially. Although dd.read_parquet doesn't need to load the whole file, it still needs to read some headers to get column info. Doing it sequentially on 10k files adds up.
So, what is the proper/efficient way to read a bunch of parquet files all of which don't have the same set of columns?
dd.concat should take care of your normalization.
Consider this example:
import pandas as pd
import dask.dataframe as dd
import numpy as np
import string
N = 100_000
all_files = []
for col in string.ascii_uppercase[1:]:
df = pd.DataFrame({
"A": np.random.normal(size=N),
col: (np.random.normal(size=N) ** 2) * 50,
})
fname = f"df_{col}.parquet"
all_files.append(fname)
df.to_parquet(fname)
full_df = dd.concat([dd.read_parquet(path) for path in all_files]).compute()
And I get this on my task stream dashboard:
Another option that was not mentioned in the comments by #Michael Delgado is to load each parquet into a separate dask dataframe and then stitch them together. Here's the rough pseudocode:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])

Dtypewarning columns(1,2,3,4,5..............142)

I have hundreds of asc file i want to concat them using python pandas
here is my code
import pandas as pd
import glob
import os
joined_files = os.path.join("*.asc")
joined_list = glob.glob(joined_files)
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
print(df)
actually my files contains 43 columns and 8395 rows
it showing Dtype error
how can I solve it
/home/user/anaconda3/lib/python3.9/site-packages/pandas/core/reshape/concat.py:294: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,136,137,138,139,140,141,142) have mixed types.Specify dtype option on import or set low_memory=False.
Try the below,
Considering your joined_list has all the paths of the files as list, then below code can be tried on top of it.
df_list = [pd.read_csv(x, dtype=str) for x in joined_list]
df = pd.concat(df_list, ignore_index=True)

How to save specific columns from different files into one file

I am new to python and I'm using Python 3.9.6. I have about 48 different files all with "Cam_Cantera_IDT_output_800K_*.csv" in the name. For example: Cam_Cantera_IDT_output_800K_7401.csv and Cam_Cantera_IDT_output_800K_8012.csv. All the files have a time column t followed by many columns named X_oh, X_ch2, X_h20 etc. I want to write a code that goes to each of these 48 files and takes only the t and X_ch2 columns and saves them in a new file. Since the time column is the same for all 48 files I just want 1 time column t followed by 48 columns of X_ch2, so 49 columns in total.
My first attempt was using append:
#Using .append, this code runs
import pandas as pd
import glob
require_cols = ['t', 'X_ch2']
all_data = pd.DataFrame()
for f in glob.glob("Cam_Cantera_IDT_output_800K_*.csv"):
df = pd.read_csv(f, usecols = require_cols)
all_data = all_data.append(df,ignore_index=True)
all_data.to_csv("new_combined_file.csv")
This code ran but it appended each file one underneath the other, so I had only 2 columns, one for t and one for X_ch2, but many rows. I later read that that's what append does, it adds the files one underneath the other.
I then tried using pd.concat to add the columns vertically next to each other. The code I used can be seen below. I unfortunately got the same result as with append. I got just 2 columns, one for time t and one for X_ch2 and all the files were added underneath each other.
#Attempting using pd.concat, this code runs
import glob
import pandas as pd
file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
require_cols = ['t', 'X_ch2']
combined_csv_data = pd.concat([pd.read_csv(f, usecols = require_cols) for f in all_filenames], axis=0)
print(combined_csv_data)
combined_csv_data.to_csv('combined_csv_data.csv')
My last attempt was using pd.merge, I added on='t' for it to merge on the time column so that I only have 1 time column. I keep getting the error that I am missing 'right'. But when I add it to the line it tells me that right is not defined:
combined_csv_data = pd.merge(right, [pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t') => gives 'right' is not defined.
I tried using right_on = X_ch2 or right_index=True but nothing seems to work.
The original code I tried, without any 'right' is shown below.
# A merge attempt
#TypeError: merge() missing 1 required positional argument: 'right'
import glob
import pandas as pd
file_extension = 'Cam_Cantera_IDT_output_800K_*.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
require_cols = ['t', 'X_ch2']
combined_csv_data = pd.merge([pd.read_csv(f, usecols = require_cols) for f in all_filenames], on='t')
print(combined_csv_data)
combined_csv_data.to_csv('combined_csv_data.csv')
Any help would be highly appreciated, thank you.
Use axis=1 in concat and convert t column to index, select column X_ch2 for Series:
require_cols = ['t', 'X_ch2']
L = [pd.read_csv(f,usecols = require_cols, index_col=['t'])['X_ch2'] for f in all_filenames]
combined_csv_data = pd.concat(L, axis=1)
If need columns names rename by filenames for avoid duplicated 48 columns names:
import os
L = [pd.read_csv(f,
usecols = require_cols,
index_col=['t'])['X_ch2'].rename(os.path.basename(f))
for f in all_filenames]
If installing convtools library is an option, then:
import glob
from convtools import conversion as c
from convtools.contrib.tables import Table
required_cols = ["t", "X_ch2"]
table = None
for number, f in enumerate(glob.glob("Cam_Cantera_IDT_output_800K_*.csv")):
table_ = (
Table.from_csv(f, header=True)
.take(*required_cols)
.rename({"X_ch2": f"X_ch2__{number}"})
)
if table is None:
table = table_
else:
table = table.join(table_, on="t", how="full")
table.into_csv("new_combined_file.csv", include_header=True)

Append only unlike data from one .csv to another .csv

I have managed to use Python with the speedtest-cli package to run a speedtest of my Internet speed. I run this every 15 min and append the results to a .csv file I call "speedtest.csv". I then have this .csv file emailed to me every 12 hours, which is a lot of data.
I am only interested in keeping the rows of data that return less than 13mbps Download speed. Using the following code, I am able to filter for this data and append it to a second .csv file I call speedtestfilteronly.csv.
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
df.to_csv('c:\speedtestfilteronly.csv', mode='a', header=False)
The problem now is it appends all the rows that match my filter criteria every time I run this code. So if I run this code 4 times, I receive the same 4 sets of appended data in the "speedtestfilteronly.csv" file.
I am looking to only append unlike rows from speedtest.csv to speedtestfilteronly.csv.
How can I achieve this?
I have got the following code to work, except the only thing it is not doing is filtering the results to < 13000000.0 mb/s: Any other ideas?
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtest.csv')
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\emailspeedtest.csv', header=None, index=False)
There's a few different way you could approach this, one would be to read in your filtered dataset, append the new one in memory and then drop duplicates like this:
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtestfilteronly.csv', header=None)
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\speedtestfilteronly.csv', header=None, index=False)

writing a header with 125,000+ columns and two rows

Excel limits the columns of any csv file around 3000. I am trying to write 125,000 columns in the following format:
O1
MA1
MI1
C1
V1
...
O125000
MA125000
MI125000
C125000
V125000
import pandas as pd
def formatting(i):
return tuple(map(lambda x: x+str(i), ("O", "MA", "MI", "C", "V")))
l = []
for i in range(1, 125001):
l.extend(formatting(i))
f = pd.read_csv('file.csv')
f.columns = l
f.to_csv('new_file.csv')
I tried coding this script but its too slow and inconsistent in the fact that its prone to errors. However, you can get the idea of what I am trying to do from this script.
The current script I use to generate a csv(that contains 2 rows and 125,000+ columns) is the following:
import pandas as pd
import glob
allfiles = glob.glob('*.csv')
index = 0
def testing(file):
#file = file.loc[:,'Open':'Volume']
file = file.values.reshape(1, -1)
return file
for _fileT in allfiles:
nFile = pd.read_csv(_fileT, header=0, usecols=range(1,6))
fFile = testing(nFile)
df = pd.DataFrame(fFile)
new_df = df.iloc[:125279]
new_df = new_df.shift(1, axis=1)
new_df.to_csv('HeadCSV/FinalCSV.csv', mode='a', index=False, header=0)
This script reads two csv files in the directory, and aggregates them into one file however how can I make sure that it prints the header mentioned above and labels the two rows it prints out?
Id basically like to combine these two scripts in the most logical way possible.
the idea is to write the header, then get all the data from the files into the dataframe, then do the row indexing as mentioned, and finally throw it all into a CSV

Categories