Maintaining column order when adding two dataframes with similar formats - python

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................

df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

Related

merging rows and replacing NaN values with pandas

I am trying to merge rows with each other to get one row containing all the values that are present. Currently the df look like this:
dataframe
What i want is something like:
| index | scan .. | snel. | kool .. | note .. |
| ----- | ------- | ----- | ------- | ------- |
| 0 | 7,8 | 4,0 | 20.0 | Fiasp, ..|
I can get that output in the code example below but it just seems really messy.
I tried to use groupby, agg, sum, max, and all those do is that it removes columns and looks like this:
df2.groupby('Tijdstempel apparaat').max().reset_index()
I tried filling the row with the values of the previous rows, and then drop the rows that dont contain every value. But this seems like a long work around and really messy.
df2 = df2.loc[df['Tijdstempel apparaat'] == '20-01-2023 13:24']
df2 = df2.reset_index()
del df2['index']
df2['Snelwerkende insuline (eenheden)'].fillna(method='pad', inplace=True)
df2['Koolhydraten (gram)'].fillna(method='pad', inplace=True)
df2['Notities'].fillna(method='pad', inplace=True)
df2['Scan Glucose mmol/l'].fillna(method='pad', inplace=True)
print(df2)
# df2.loc[df2[0,'Snelwerkende insuline (eenheden)']] = df2.loc[df2[1, 'Snelwerkende insuline (eenheden)']]
df2.drop([0, 1, 2])
Output:
When i have to do this for the entire data.csv (whenever a time stamp like "20-01-2023 13:24" is found multiple times) i am worried it wil be really slow and time consuming.
sample data as your data
df = pd.DataFrame(data={
"times":["date1","date1","date1","date1","date1"],
"type":[1,2,3,4,5],
"key1":[1,None,None,None,None],
"key2":[None,"2",None,None,None],
"key3":[None,None,3,None,None],
"key4":[None,None,None,"val",None],
"key5":[None,None,None,None,5],
})
solution
melt = df.melt(id_vars="times",
value_vars=df.columns[1:],)
melt = melt.dropna()
pivot = melt.pivot_table(values="value", index="times", columns="variable", aggfunc=lambda x: x)
change type column location
index = list(pivot.columns).index("type")
pivot = pd.concat([pivot.iloc[:,index:], pivot.iloc[:,:index]], axis=1)

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

Use one data-frame (used as a dictionary) to fill in the main data-frame (Python, Pandas)

I have a central DataFrame called "cases" (5000000 rows × 5 columns) and a secondary DataFrame, called "relevant information", which is a kind of dictionary in relation to the central DataFrame (300 rows × 6 columns).
I am trying to fill in the central DataFrame based on a common column called "Verdict_type".
And, if the value does not appear in the secondary DataFrame it fill in "not_relevant" in all the rows that will be added.
I used all sorts of directions without success.
I would love to get a good direction.
The DataFrames
import pandas as pd
# this is a mockup of the raw data
cases = [
[1, "1", "v1"],
[2, "2", "v2"],
[3, "3", "v3"]
]
relevant_info = [
["v1", "info1"],
["v3", "info3"]
]
# these are the data from screenshot
df_cases = pd.DataFrame(cases, columns=["id", "verdict_name", "verdict_type"]).set_index("id")
df_relevant_info = pd.DataFrame(relevant_info, columns=["verdict_type", "features"])
Input:
df_cases <-- note here the index marked as 'id'
df_relevant_info
# first, flatten the index of the cases ( this is probably what you were missing )
df_cases = df_cases.reset_index()
# then, merge the two sets on the verdict_type
df_merge = pd.merge(df_cases, df_relevant_info, on="verdict_type", how="outer")
# finally, mark missing values as non relevant
df_merge["features"] = df_merge["features"].fillna(value="not_relevant")
Output:
merged set:
+----+------+----------------+----------------+--------------+
| | id | verdict_name | verdict_type | features |
|----+------+----------------+----------------+--------------|
| 0 | 1 | 1 | v1 | info1 |
| 1 | 2 | 2 | v2 | not_relevant |
| 2 | 3 | 3 | v3 | info3 |
+----+------+----------------+----------------+--------------+

pandas merge header rows if one is not NaN

I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.
Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']
This should work for you:
df.columns = list(df.columns.get_level_values(0))
Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])

What's the most efficient way to accumulate dataframes in pyspark?

I have a dataframe (or could be any RDD) containing several millions row in a well-known schema like this:
Key | FeatureA | FeatureB
--------------------------
U1 | 0 | 1
U2 | 1 | 1
I need to load a dozen other datasets from disk that contains different features for the same number of keys. Some datasets are up to a dozen or so columns wide. Imagine:
Key | FeatureC | FeatureD | FeatureE
-------------------------------------
U1 | 0 | 0 | 1
Key | FeatureF
--------------
U2 | 1
It feels like a fold or an accumulation where I just want to iterate all the datasets and get back something like this:
Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF
---------------------------------------------------------------------
U1 | 0 | 1 | 0 | 0 | 1 | 0
U2 | 1 | 1 | 0 | 0 | 0 | 1
I've tried loading each dataframe then joining but that takes forever once I get past a handful of datasets. Am I missing a common pattern or efficient way of accomplishing this task?
Assuming there is at most one row per key in each DataFrame and all keys are of primitive types you can try an union with an aggregation. Lets start with some imports and example data:
from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame
df1 = sc.parallelize([
("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])
df2 = sc.parallelize([
("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])
df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])
dfs = [df1, df2, df3]
Next we can extract common schema:
output_schema = StructType(
[df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)
and transform all DataFrames:
transformed_dfs = [df.select(*[
lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns
else col(c.name)
for c in output_schema.fields
]) for df in dfs]
Finally an union and dummy aggregation:
combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)
If there is more than one row per key but individual columns are still atomic you can try to replace max with collect_list / collect_set followed by explode.

Categories