merging rows and replacing NaN values with pandas

merging rows and replacing NaN values with pandas - python

I am trying to merge rows with each other to get one row containing all the values that are present. Currently the df look like this:
dataframe
What i want is something like:
| index | scan .. | snel. | kool .. | note .. |
| ----- | ------- | ----- | ------- | ------- |
| 0 | 7,8 | 4,0 | 20.0 | Fiasp, ..|
I can get that output in the code example below but it just seems really messy.
I tried to use groupby, agg, sum, max, and all those do is that it removes columns and looks like this:
df2.groupby('Tijdstempel apparaat').max().reset_index()
I tried filling the row with the values of the previous rows, and then drop the rows that dont contain every value. But this seems like a long work around and really messy.
df2 = df2.loc[df['Tijdstempel apparaat'] == '20-01-2023 13:24']
df2 = df2.reset_index()
del df2['index']
df2['Snelwerkende insuline (eenheden)'].fillna(method='pad', inplace=True)
df2['Koolhydraten (gram)'].fillna(method='pad', inplace=True)
df2['Notities'].fillna(method='pad', inplace=True)
df2['Scan Glucose mmol/l'].fillna(method='pad', inplace=True)
print(df2)
# df2.loc[df2[0,'Snelwerkende insuline (eenheden)']] = df2.loc[df2[1, 'Snelwerkende insuline (eenheden)']]
df2.drop([0, 1, 2])
Output:
When i have to do this for the entire data.csv (whenever a time stamp like "20-01-2023 13:24" is found multiple times) i am worried it wil be really slow and time consuming.

sample data as your data
df = pd.DataFrame(data={
"times":["date1","date1","date1","date1","date1"],
"type":[1,2,3,4,5],
"key1":[1,None,None,None,None],
"key2":[None,"2",None,None,None],
"key3":[None,None,3,None,None],
"key4":[None,None,None,"val",None],
"key5":[None,None,None,None,5],
})
solution
melt = df.melt(id_vars="times",
value_vars=df.columns[1:],)
melt = melt.dropna()
pivot = melt.pivot_table(values="value", index="times", columns="variable", aggfunc=lambda x: x)
change type column location
index = list(pivot.columns).index("type")
pivot = pd.concat([pivot.iloc[:,index:], pivot.iloc[:,:index]], axis=1)

Related

Problems extracting data from pandas data frame as a result of Grouping

In order to plot the frequency of tornados every 10 days I have grouped the data in groups of 10 days using
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
however the column DATE does not exist in the code as shown when I run:
>>> df_grouped.shape
(1041,1)
despite the fact that I am able to view and plot the dates in the Jupiter notebook GUI 1.
This is an issue as I wish to access this data later for other purposes and I am unable to using:
year = pd.to_datetime(df_grouped['DATE'], dayfirst = True, errors='coerce').dt.year.values
df_grouped['year'] = year
It states that there is an invalid indexing error since the column no longe exists. Does anyone know what I can do to access the data?
MINIMUM REPRODUCIBLE EXAMPLE
import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df_grouped = pd.DataFrame()
df_grouped['COUNT'] = df.groupby(pd.Grouper(key='DATE', freq='10D'))['DATE'].count().to_frame()
expected output
|DATE |COUNT |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |
actual output
| |COUNT |
|DATE | |
|1994-01-01|10 |
|1994-01-11|10 |
|1994-01-21|10 |
|1994-01-31|01 |

import pandas as pd
df = pd.DataFrame(pd.date_range(start='1994-01-01', end='1994-01-21'), columns=['DATE'])
df = (df.assign(COUNT=lambda x: 1)
.groupby(pd.Grouper(key='DATE', freq='10D')).count()
.reset_index())
print(df)
# DATE COUNT
# 0 1994-01-01 10
# 1 1994-01-11 10
# 2 1994-01-21 1

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?

If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?

You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.

Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

pandas merge header rows if one is not NaN

I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.

Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']

This should work for you:
df.columns = list(df.columns.get_level_values(0))

Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])

How to compare a specific column in two csv files and output differences to a third file

I have two csv files named test1.csv and test2.csv and they both have a column named 'Name'. I would like to compare each row in this Name column between both files and output the ones that don't match to a third file. I have seen some examples using pandas, but none worked for my situation. Can anyone help me get a script going for this?
Test2 will be updated to include all values from test1 plus new values not included in test1 (which are the ones i want saved to a third file)
An example of what the columns look like is:
test1.csv:
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
test2.csv
Name Number Status
gfd454 456 Disposed
3v4fd 521 Disposed
th678iy 678 Disposed
vb556h 665 Disposed

See below.
The idea is to read the names into s python set data structure and find the new names by doing set substruction.
1.csv:
Name Number
A 12
B 34
C 45
2.csv
Name Number
A 12
B 34
C 45
D 77
Z 67
The code below will print {'D', 'Z'} which are the new names.
def read_file_to_set(file_name):
with open(file_name) as f:
return set(l.strip().split()[0] for x,l in enumerate(f.readlines()) if x > 0)
set_1 = read_file_to_set('1.csv')
set_2 = read_file_to_set('2.csv')
new_names = set_2 - set_1
print(new_names)

This answer assumes that the data is lined up as in your example:
import pandas as pd
# "read" each file
df1 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy']})
df2 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy', 'fdvs']})
# make column names unique
df1 = df1.rename(columns={'Name': 'Name1'})
df2 = df2.rename(columns={'Name': 'Name2'})
# line them up next to each other
df = pd.concat([df1, df2], axis=1)
# get difference
diff = df[df['Name1'].isnull()]['Name2'] # or df[df['Name1'] != df['Name2']]['Name2']
# write
diff.to_csv('test3.csv')

This should be straight forward - the solution assumes that the content of file2 is the same or longer, so items are only appended to file2.
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
# print(df1)
# print(df2)
df = pd.concat([df1, df2], axis=1)
df['X'] = df['A'] == df['B']
print(df[df.X==False])
df3 = df[df.X==False]['B']
print(df3)
df3.to_csv(r"C:\path\to\file3.csv")
If the items are in arbitrary order, you could use df.isin() as follows:
import pandas as pd
df1 = pd.read_csv(r"C:\path\to\file1.csv")
df2 = pd.read_csv(r"C:\path\to\file2.csv")
df = pd.concat([df1, df2], axis=1)
df['X'] = df['B'].isin(df['A'])
df3 = df[df.X==False]['B']
df3.to_csv(r"C:\path\to\file3.csv")
I have created the following 2 files:
A
1_in_A
2_in_A
3_in_A
4_in_A
and file2.csv:
B
2_in_A
1_in_A
3_in_A
4_in_B
5_in_B
for testing. The dataframe df looks as follows:
| | A | B | X |
|---:|:-------|:-------|:------|
| 0 | 1_in_A | 2_in_A | True |
| 1 | 2_in_A | 1_in_A | True |
| 2 | 3_in_A | 3_in_A | True |
| 3 | 4_in_A | 4_in_B | False |
| 4 | nan | 5_in_B | False |
and we select only the items that are flagged as False.

Maintaining column order when adding two dataframes with similar formats

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................

df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merging rows and replacing NaN values with pandas - python

Related

Problems extracting data from pandas data frame as a result of Grouping

How can I copy values from one dataframe column to another based on the difference between the values

pandas merge header rows if one is not NaN

How to compare a specific column in two csv files and output differences to a third file

Maintaining column order when adding two dataframes with similar formats

Categories

Resources