Initializing an empty DataFrame and appending rows - python

Different from creating an empty dataframe and populating rows later , I have many many dataframes that needs to be concatenated.
If there were only two data frames, I can do this:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df1.append(df2, ignore_index=True)
Imagine I have millions of df that needs to be appended/concatenated each time I read a new file into a DataFrame object.
But when I tried to initialize an empty dataframe and then adding the new dataframes through a loop:
import pandas as pd
alldf = pd.DataFrame(, columns=list('AB'))
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
alldf.append(df, ignore_index=True)
This would return an empty alldf with only the header row, e.g.
alldf = pd.DataFrame(columns=list('AB'))
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
for df in [df1, df2]:
alldf.append(df, ignore_index=True)

df.concat() over an array of dataframes is probably the way to go, especially for clean CSVs. But in case you suspect your CSVs are either dirty or could get recognized by read_csv() with mixed types between files, you may want to explicity create each dataframe in a loop.
You can initialize a dataframe for the first file, and then each subsequent file start with an empty dataframe based on the first.
df2 = pd.DataFrame(data=None, columns=df1.columns,index=df1.index)
This takes the structure of dataframe df1 but no data, and create df2. If you want to force data type on columns, then you can do it to df1 when it is created, before its structure is copied.
more details

From #DSM comment, this works:
import pandas as pd
dfs = []
for filename in os.listdir(indir):
df = pd.read_csv(indir+filename, delimiter=' ')
dfs(df)
alldf = pd.concat(dfs)

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```
pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)

Pyspark partitionBy: How do I partition my data and then select columns

I have the following data:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
I want to partition the data by 'col1', but I don't want the 'col1' variable to be in the final data. Is this possible?
The below code would partition by col1, but how do I ensure 'col1' doesn't appear in the final data?
from pyspark.sql.functions import *
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv",
header=True)
Final data would be two files that look like:
d1 = {'col2': [3], 'col3': [5]}
df1 = pd.DataFrame(data=d1)
d2 = {'col2': [4], 'col3': [6]}
df2 = pd.DataFrame(data=d2)
Seems simple, but i can't figure out how I can partition the data, but leave the variable used to partition out of the final csv?
Thanks
ONce you partition the data using
df.write.partitionBy("col1").mode("overwrite").csv("file_path/example.csv", header=True)
There will be partitions based on your col1.
Now while reading the dataframe you can specify which columns you want to use like:
df=spark.read.csv('path').select('col2','col3')
Below is the code for spark 2.4.0 using scala api-
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row(1,3,5),Row(2,4,6))),
StructType(Seq.range(1,4).map(f => StructField("col" + f, DataTypes.IntegerType))))
df.write.partitionBy("col1")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/<path>/test")
It creates 2 files as below-
col1=1 with actual partition file as below-
col2,col3
3,5
col2=2 with actual partition file as below-
col2,col3
4,6
same for col2=2
I'm not seeing col1 in the file.
in python-
from pyspark.sql import Row
df = spark.createDataFrame([Row(col1=[1, 2], col1=[3, 4], col3=[5, 6])])
df.write.partitionBy('col1').mode('overwrite').csv(os.path.join(tempfile.mkdtemp(), 'data'))
api doc - https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Pandas: Getting "TypeError: only integer scalar arrays can be converted to a scalar index" while trying to merge data frames

After renaming a DataFrame's column(s), I get an error when merging on the new column(s):
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = [['b']]
df1.merge(df2, on='b')
TypeError: only integer scalar arrays can be converted to a scalar index
When renaming columns, use DataFrame.columns = [list], not DataFrame.columns = [[list]]:
df1 = pd.DataFrame({'a': [1, 2]})
df2 = pd.DataFrame({'b': [3, 1]})
df1.columns = ['b']
df1.merge(df2, on='b')
# b
# 0 1
Replaced the code tmp.columns = [['POR','POR_PORT']] with tmp.rename(columns={'Locode':'POR', 'Port Name':'POR_PORT'}, inplace=True) and it worked.

Intepreting Pandas Column Referencing Syntax

I have a basic background in using R for data wrangling but am new to Python. I came across this code snippet from a tutorial on Coursera.
Can someone please explain to me what columns ={col:'Gold' + col[4:]}, inplace = True means?
(1) From my understanding, df.rename is to rename the existing column name to (in the case of first line, Gold) but why is there a need to +col[4:] after it?
(2) Does declaring the function inplace as True mean to assign the resulting df output to the original df?
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Thank you in advance.
It means:
#for each column name
for col in df.columns:
#check first 2 chars for 01
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
#similar like above
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
#check first letter
if col[:1]=='№':
#add # after first letter
df.rename(columns={col:'#'+col[1:]}, inplace=True)
Does declaring the function inplace as True mean to assign the resulting df output to the original dataframe
Yes, you are right. It replace inplace columns names.
if col[:2]=='01':
#replace column name with text gold and all characters after 4th letter
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
(1). If col has a column name of '01xx1234',
1. col[:2] = 01 is True
2. 'Gold'+col[4:] => 'Gold'+col[4:] => 'Gold1234'
3. so, '01xx1234' is replaced by 'Gold1234'.
(2) inplace = True applies directly to a dataframe and does not return a result.
If you do not add this option, you have to do like this.
df = df.rename(columns={col:'Gold'+col[4:]})
inplace=True means: The columns will be renamed in your original dataframe (df)
Your case (inplace=True):
import pandas as pd
df = pd.DataFrame(columns={"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
print(df.columns)
# Index(['a', 'c'], dtype='object')
# df already has the renamed columns, because inplace=True.
If you wouldn't use inplace=True, then the rename method would generate a new dataframe, like this:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
new_frame = df.rename(columns={"A": "a", "B": "c"})
print(df.columns)
# Index(['A', 'B'], dtype='object')
# It contains the old column names
print(new_frame.columns)
# Index(['a', 'c'], dtype='object')
# It's a new dataframe and has renamed columns
NOTE: In this case, better approach to assign the new dataframe to the original dataframe (df)
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df = df.rename(columns={"A": "a", "B": "c"})

Selecting columns from dataframe based on the name of other dataframe

I have 3 dataframes,
df
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID':[4, 5, 6, 8, 9],
'CC_ID' : [2, 2, 3, 3, 3],
'DD_RE': [4,7,8,9,0],
'EE_RE':[5,8,9,9,10]})
and df_ID,
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
and the other one isdf_RE, both of these data frames has the column Name, so I need to merge it to data frame df, then I need to select the columns based on the last part of the data frame's name. That is, for example, if the data frame is df_ID then I need all columns ending with "ID" + "Name" for all matching rows from Name from data frame df, and if the data frame id df_REL then I need I all columns ends with "RE" + "Name" from df and I wanted to save it separately.
I know I could call inside the loop as,
for dfs in dataframes:
ID=[col for col in df.columns if '_ID' in col]
df_ID=pd.merge(df,df_ID,on='Name')
df_ID=df_ID[ID]
But here the ID , has to change again when the data frames ends with RE and so on , I have a couple of file with different strings so any better solution would be great
So at the end I need for df_ID as having all the columns ending with ID
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16'],
'AA_ID': [22, 22'],
'BB_ID':[4, 5],
'CC_ID' : [2, 2]})
Any help would be great
Assuming your columns in df are Name and anything with a suffix such as the examples you have listed (e.g. _ID, _RE), then what you could do is parse through the column names to first extract all unique possible suffixes:
# since the suffixes follow a pattern of `_*`, then I can look for the `_` character
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
Now, with the list of suffixes, you next want to create a dictionary of your existing dataframes, where the keys in the dictionary are suffixes, and the values are the dataframes with the suffix names (e.g. df_ID, df_RE):
dfs = {}
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
... # and so forth
Now you can loop through your suffixes list to extract the appropriate columns with each suffix in the list and do the merges and column extractions:
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
Now you have your dictionary of suffixed dataframes. If you want your dataframes as separate variables instead of keeping them in your dictionary, you can now set them back as individual objects:
df_ID = dfs['_ID']
df_RE = dfs['_RE']
... # and so forth
Putting it all together in an example
import pandas as pd
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID': [4, 5, 6, 8, 9],
'CC_ID': [2, 2, 3, 3, 3],
'DD_RE': [4, 7, 8, 9, 0],
'EE_RE': [5, 8, 9, 9, 10]})
# Get unique suffixes
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
dfs = {} # dataframes dictionary
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
df_RE = pd.DataFrame({'Name': ['AC007']})
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
df_ID = dfs['_ID']
df_RE = dfs['_RE']
print(df_ID)
print(df_RE)
Result:
AA_ID BB_ID CC_ID
0 22 4 2
1 22 5 2
DD_RE EE_RE
0 8 9
1 9 9
2 0 10
You can first merge df with df_ID and then take the columns end with ID.
pd.merge(df,df_ID,on='Name')[[e for e in df.columns if e.endswith('ID') or e=='Name']]
Out[121]:
AA_ID BB_ID CC_ID Name
0 22 4 2 CTA15
1 22 5 2 CTA16
Similarly, this can be done for the df_RE df as well.
pd.merge(df,df_RE,on='Name')[[e for e in df.columns if e.endswith('RE') or e=='Name']]

Categories