Adding a pandas.dataframe to another one with it's own name - python

I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output

You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt

There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)

Related

How to return columns with with similar names in a dataframe using pandas

Say I have a table that looks something like:
+----------------------------------+-------------------------------------+----------------------------------+
| ExperienceModifier|ApplicationId | ExperienceModifier|RatingModifierId | ExperienceModifier|ActionResults |
+----------------------------------+-------------------------------------+----------------------------------+
| | | |
+----------------------------------+-------------------------------------+----------------------------------+
I would like to grab all of the columns that lead with 'ExperienceModifier' and stuff the results of that into its own dataframe. How would I accomplish this with pandas?
You can try pandas.DataFrame.filter
df.filter(like='ExperienceModifier')
If you want to get columns only contains ExperienceModifier at the beginning.
df.filter(regex='^ExperienceModifier')
Ynjxsjmh's answer will get all columns that contain "ExperienceModifier". If you literally want columns that start with that string, rather merely contain it, you can do new_df = df[[col for col in df.columns if col[:18] == 'ExperienceModifier']]. If all of the desired columns have | after "ExperienceModifier", you could also do new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']]. All of these will create a view of the dataframe. If you want a completely separate dataframe, you should copy it, like this: new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']].copy(). You also might want to create a multi-index by splitting the column names on | rather than creating a separate dataframe.
The accepted answer does easly the job but I still attach my "hand made version" that works:
import pandas as pd
import numpy as np
import re
lst = [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]]
column_names = [['ExperienceModifier|ApplicationId', 'ExperienceModifier|RatingModifierId', 'ExperienceModifier|ActionResults','OtherName|ActionResults']]
data = pd.DataFrame(lst, columns = column_names)
data
old_and_dropped_dataframes = []
new_dataframes=[]
for i in np.arange(0,len(column_names[0])):
column_names[0][i].split("|")
splits=re.findall(r"[\w']+", column_names[0][i])
if "ExperienceModifier" in splits:
new_dataframe = data.iloc[:,[i]]
new_dataframes.append(new_dataframe)
else:
old_and_dropped_dataframe = data.iloc[:,[i]]
old_and_dropped_dataframes.append(old_and_dropped_dataframe)
ExperienceModifier_dataframe = pd.concat(new_dataframes,axis=1)
ExperienceModifier_dataframe
OtherNames_dataframe = pd.concat(old_and_dropped_dataframes,axis=1)
OtherNames_dataframe
This script creates two new dataframes starting from the initial dataframe: one that contains the columns whose names start with ExperienceModifier and an other one that contains the columns that do not start with ExperienceModifier.

Python pandas convert csv file into wide long txt file and put the values that have the same name in the "MA" column in the same row

I want to get a file from the csv file formatted as follows:
CSV file:
Desired output txt file (Header italicized):
MA Am1 Am2 Am3 Am4
MX1 X Y - -
MX2 9 10 11 12
Any suggestions on how to do this? Thank you!
Need help with writing the python code for achieving this. I've tried to loop through every row, but still struggling to find a way to write this.
You can try this.
Based on unique MA value groups, get the values [names column here]
Create a new dataframe with it.
Expand the values list to columns and add it to new dataframe.
Copy name column from first data frame.
Reorder 'name' column.
Code:
import pandas as pd
df = pd.DataFrame([['MX1', 1, 222],['MX1', 2, 222],['MX2', 4, 44],['MX2', 3, 222],['MX2', 5, 222]], columns=['name','values','etc'])
df_new = pd.DataFrame(columns = ['name', 'values'])
for group in df.groupby('name'):
df_new.loc[-1] = [group[0], group[1]['values'].to_list()]
df_new.index = df_new.index + 1
df_new = df_new.sort_index()
df_expanded = pd.DataFrame(df_new['values'].values.tolist()).add_prefix('Am')
df_expanded['name'] = df_new['name']
cols = df_expanded.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_expanded = df_expanded[cols]
print(df_expanded.fillna('-'))
Output:
name Am0 Am1 Am2
0 MX2 4 3 5.0
1 MX1 1 2 -

Slicing DAT file by Fixed Width Stored in Dict

I am having some trouble (been trying this for long time) and still couldn't get solution on my own. I have a dat file that looks like this format:
abc900800007.2
And I have a dict that contains the column name as key and the values corresponding to the fixed width for the DAT file, my dict goes like mydict = {'col1': 3, 'col2': 8, 'col3': 3).
What I want to do is to create a df by combining both item, so slicing the DAT file through the dict value. The df should be like:
col1 col 2 col 3
abc 90080000 7.2
Any help would be highly appreciated!
I think a possible (but depending on the file size memory intensive) solution is:
data = {'col1':[], 'col2':[], 'col3':[]}
for line in open('file.dat'):
data['col1'].append(line[:mydict['col1']])
begin = mydict['col1']
end = begin + mydict['col2']
data['col2'].append(line[begin:end])
begin = end
end = begin + mydict['col3']
data['col3'].append(line[begin:end])
df = pd.DataFrame(data) # create the DataFrame
del data # delete the auxiliar data

Pandas iterate over each row of a column and change its value

I have a pandas dataframe which looks like this:
Name Age
0 tom 10
1 nick 15
2 juli 14
I am trying to iterate over each name --> connect to a mysql database --> match the name with a column in the database --> fetch the id for the name --> and replace the id in the place of name
in the above data frame. The desired output is as follows:
Name Age
0 1 10
1 2 15
2 4 14
The following is the code that I have tried:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
for index, rows in df.iterrows():
cquery="select id from students where studentsName="+'"' + rows['Name'] + '"'
sid = pd.read_sql(cquery, con=engine)
df['Name'] = sid['id'].iloc[0]
print(df[['Name','Age')
The above code prints the following output:
Name Age
0 1 10
1 1 15
2 1 14
Name Age
0 2 10
1 2 15
2 2 14
Name Age
0 4 10
1 4 15
2 4 14
I understand it iterates through the entire table for each matched name and prints it. How do you get the value replaced only once.
Slight rewrite of your code, if you want to do a transformation in general on a dataframe this is a better way to go about it
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def replace_name(name: str) -> int:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
return sid['id'].iloc[0]
df[Name] = df[Name].apply(lambda x: replace_name(x.value))
This should perform the transformation you're looking for
The problem in your code as written is the line:
df['Name'] = sid['id'].iloc[0]
This sets every value in the Name column to the first id entry in your query result.
To accomplish what you want, you want something like:
df.loc[index, 'Name'] = sid['id'].iloc[0]
This will set the value at index location index in column name to the first id entry in your query result.
This will accomplish what you want to do, and you can stop reading here if you're in a hurry. If you're not in a hurry, and you'd like to become wiser, I encourage you to read on.
It is generally a mistake to loop over the rows in a dataframe. It's also generally a mistake to iterate through a list carrying out a single query on each item in the list. Both of these are slow and error-prone.
A more idiomatic (and faster) way of doing this would be to get all the relevant rows from the database in one query, merge them with your current dataframe, and then drop the column you no longer want. Something like the following:
names = df['Name'].tolist()
query = f"select id, studentsName as Name where name in({','.join(names)})"
student_ids = pd.read_sql(query, con=engine)
df_2 = df.merge(student_ids, on='Name', how='left')
df_with_ids = df_2[['id', 'Age']]
One query executed, no loops to worry about. Let the database engine and Pandas do the work for you.
You can do this kind of operations the following way, please follow comments and feel free to ask questions:
import pandas as pd
# create frame
x = pd.DataFrame(
{
"name": ["A", "B", "C"],
"age": [1, 2, 3]
}
)
# create some kind of db
mock_database = {"A": 10, "B": 20, "C": 30}
x["id"] = None # add empty column
print(x)
# change values in the new column
for i in range(len(x["name"])):
x["id"][i] = mock_database.get(x["name"][i])
print("*" * 100)
print(x)
A good way to do that would be :
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
name_ids = []
for student_name in df['Name']:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
name_ids.append(sid if sid is not None else None )
# DEBUGED WITH name_ids = [1,2,3]
df['Name'] = name_ids
print(df)
I checked with an example list of ids and it works , I guess if the query format is correct this will work.
Performance-wise I could not think a better solution , since you will have to do a lot of queries (one for each student) but there probably is some way to get all the ids with less queries.

Python: Add rows with different column names to dict/dataframe

I want to add data (dictionaries) to a dictionary, where every added dictionary represent a new row. It is a iterative process and it is not known what column names a new added dictionary(row) could have. In the end I want a pandas dataframe. Furthermore I have to write the dataframe every 1500 rows to a file ( which is a problem, because after 1500 rows, it could of course happen that new data is added which has columns that are not present in the already written 1500 rows to the file).
I need a approach which is very fast (maybe 26ms per row). My approach is slow, because it has to check every data if it has new column names and in the end it has to reread the file, to create a new file where all columns have the same lengths. The data comes from a queue which is processed in another process.
import pandas as pd
def writingData(exportFullName='path', buffer=1500, maxFiles=150000, writingQueue):
imagePassed = 0
with open(exportFullName, 'a') as f:
columnNamesAllList = []
columnNamesAllSet = set()
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
columnNamesUpdated = False
for data in iter(writingQueue.get, "STOP"):
print(imagesPassed)
dfTemp = pd.DataFrame([data],index=[imagesPassed])
if set(dfTemp).difference(columnNamesAllSet):
columnNamesAllSet.update(set(dfTemp))
columnNamesAllList.extend(list(dfTemp))
columnNamesUpdated = True
else:
columnNamesUpdated = False
if columnNamesUpdated:
print('Updated')
dfTempAll = dfTemp.combine_first(dfTempAll)
else:
dfTempAll.iloc[imagesPassed - 1] = dfTemp.iloc[0]
imagesPassed += 1
if imagesPassed == buffer:
dfTempAll.dropna(how='all', inplace=True)
dfTempAll.to_csv(f, sep='\t', header=True)
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
imagePassed = 0
Reading it in again:
dfTempAll = pd.DataFrame( index=range(maxFiles), columns=columnNamesAllList)
for number, chunk in enumerate(pd.read_csv(exportFullName, delimiter='\t', chunksize=buffer, low_memory=True, memory_map=True,engine='c')):
dfTempAll.iloc[number*buffer:(number+1*buffer)] = pd.concat([chunk, columnNamesAllList]).values#.to_csv(f, sep='\t', header=False) # , chunksize=buffer
#dfTempAll = pd.concat([chunk, dfTempAll])
dfTempAll.reset_index(drop=True, inplace=True).to_csv(exportFullName, sep='\t', header=True)
Small example with dataframes
So to make it clear. Lets say I have a 4 row already existent dataframe (in the real case it could have 150000 rows like in the code above), where 2 rows are already filled with data and I add a new row it could look like this with the exception that the new data is a dictionary in the raw input:
df1 = pd.DataFrame(index=range(4),columns=['A','B','D'], data={'A': [1, 2, 'NaN', 'NaN'], 'B': [3, 4,'NaN', 'NaN'],'D': [3, 4,'NaN', 'NaN']})
df2 = pd.DataFrame(index=[2],columns=['A','C','B'], data={'A': [0], 'B': [0],'C': [0] })#

Categories