Split Dataframe based on rows PYTHON [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 days ago.
Improve this question
I have a dataframe in which the data are stacked above each other like this:
I would like to make some changes in the dataframe, so it becomes like this:

It is a challenging question. You can solve it with iterate all data like this
import pandas as pd
df = pd.read_csv('df.csv', header=None)
for i in range(len(df)):
for j in df.columns:
print(df.at[i,j])
the output is :
time
Q1
1
12
2
23
3
45
7
88
9
11
10
12
.
.
.
.
.
.
time
Q2
1
11
2
9
4
9
8
8
12
7
13
2
.
.
.
.
.
.
time
Q2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
You can try make a logic like this :
# columns = []
# data = []
# columns = ['time']
# data = []
# columns = ['time','Q1']
# data = []
# columns = ['time','Q1']
# data = [[1]]
# columns = ['time','Q1']
# data = [[1,12]]
# columns = ['time','Q1']
# data = [[1,12],[2]]
# columns = ['time','Q1']
# data = [[1,12],[2,23]]
...
# columns = ['time','Q1']
# data = [[1,12],[2,23],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time']
# data = [[1,12],[2,23],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time','Q2']
# data = [[1,12],[2,23],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time','Q2']
# data = [[1,12,1],[2,23],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time','Q2']
# data = [[1,12,1,11],[2,23],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time','Q2']
# data = [[1,12,1,11],[2,23,2],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
# columns = ['time','Q1','time','Q2']
# data = [[1,12,1,11],[2,23,2,9],[3,45],[7,88],[9,11],[10,12],[.,.],[.,.],[.,.]]
...
# columns = ['time', 'Q1', 'time', 'Q2', 'time', 'Q2', ...]
# data = [['1', '12', '1', '11', '.', '.'], ['2', '23', '2', '9', '.', '.'], ['3', '45', '4', '9', '.', '.'], ['7', '88', '8', '8', '.', '.'], ['9', '11', '12', '7', '.', '.'], ['10', '12', '13', '2', '.', '.'], ['.', '.', '.', '.', '.', '.'], ['.', '.', '.', '.', '.', '.'], ['.', '.', '.', '.', '.', '.'], ...]
The logic is applied by the code like this :
import pandas as pd
df = pd.read_csv('df.csv', header=None)
columns = []
data = []
createRowMode = True
createRowCounter = 0
for i in range(len(df)):
for j in df.columns:
if df.at[i,j] == 'time' or df.at[i,j][0] == 'Q':
if createRowCounter > 1:
createRowMode = False
createRowCounter += 1
columns.append(df.at[i,j])
counter = 0
elif j == 0:
if createRowMode:
data.append([])
data[counter].append(df.at[i,j])
else:
data[counter].append(df.at[i,j])
counter += 1
new_df = pd.DataFrame(data, columns=columns)
print(new_df)
The output is :

Related

column comparison of two dataframe, return df with mismatches python

I want to print two dataframes that print the rows where there is a mismatch in a given column, here the "second_column":
"first_column" is a key value that identify same product in both dataframes
import pandas as pd
data1 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['1', '2', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df1 = pd.DataFrame(data1)
print(df1)
test = df1['second_column'].nunique()
data2 = {
'first_column': ['id1', 'id2', 'id3'],
'second_column': ['3', '4', '2'],
'third_column': ['1', '2', '2'],
'fourth_column': ['1', '2', '2']
}
df2 = pd.DataFrame(data2)
print(df2)
expected output:
IIUC
btw, you screenshots don't match your DF definition
df1.loc[~df1['second_column'].isin(df2['second_column'])]
first_column second_column third_column fourth_column
0 1 1 1 1
df2.loc[~df2['second_column'].isin(df1['second_column'])]
first_column second_column third_column fourth_column
0 1 3 1 1
1 2 4 2 2
the compare method can do what you want.
different_rows = df1.compare(df2, align_axis=1).index
df1.loc[different_rows]
With this method, one important point is if there are extra rows (index) then it will not return a difference.
or if you want to find differences in one column only, you can first join on the index then check if the join matches
joined_df = df1.join(df2['second_column'], rsuffix='_df2')
diff = joined_df['second_column']!=joined_df['second_column_df2']
print(joined_df.loc[diff, df1.columns])

join column names in a new pandas columns conditional on value

I have the following dataset:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
}
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
I want to create a new column df['Keyword'] whose value is a join of the column names with value > 0.
Expected Outcome:
data = {'Environment': ['0', '0', '0'],
'Health': ['1', '0', '1'],
'Labor': ['1', '1', '1'],
'Keyword': ['Health, Labor', 'Labor', 'Health, Labor']}
df_test = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor', 'Keyword'])
df_test
df = pd.DataFrame(data, columns=['Environment', 'Health', 'Labor'])
How do I go about it?
Other version with .apply():
df['Keyword'] = df.apply(lambda x: ', '.join(b for a, b in zip(x, x.index) if a=='1'),axis=1)
print(df)
Prints:
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Another method with mask and stack then groupby to get your aggregation of items.
stack by default drops na values.
df['keyword'] = df.mask(
df.lt(1)).stack().reset_index(1)\
.groupby(level=0)["level_1"].agg(list)
print(df)
Environment Health Labor keyword
0 0 1 1 [Health, Labor]
1 0 0 1 [Labor]
2 0 1 1 [Health, Labor]
First problem in sample data values are strings, so if want compare for greater use:
df = df.astype(float).astype(int)
Or:
df = df.replace({'0':0, '1':1})
And then use DataFrame.dot for matrix multiplication with columns names and separators, last remove it from right side:
df['Keyword'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
Environment Health Labor Keyword
0 0 1 1 Health, Labor
1 0 0 1 Labor
2 0 1 1 Health, Labor
Or compare strings - e.g. not equal '0' or equal '1':
df['Keyword'] = df.ne('0').dot(df.columns + ', ').str.rstrip(', ')
df['Keyword'] = df.eq('1').dot(df.columns + ', ').str.rstrip(', ')

Create a dictionary from lists, overwrite duplicate keys

I have my code below. I am trying to create a dictionary from my lists extracted from a txt file but the loop overwrites the previous information:
f = open('data.txt','r')
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('data.txt')]
columns=lines.pop(0)
for i in range(len(lines)):
lines[i]=lines[i].split(',')
dictt={}
for line in lines:
dictt[line[0]]=line[1:]
print('\n')
print(lines)
print('\n')
print(dictt)
I know I have to play with:
for line in lines:
dictt[line[0]] = line[1:]
part but what can I do , do I have to use numpy? If so, how?
My lines list is :
[['USS-Enterprise', '6', '6', '6', '6', '6'],
['USS-Voyager', '2', '3', '0', '4', '1'],
['USS-Peres', '10', '4', '0', '0', '5'],
['USS-Pathfinder', '2', '0', '0', '1', '2'],
['USS-Enterprise', '2', '2', '2', '2', '2'],
['USS-Voyager', '2', '1', '0', '1', '1'],
['USS-Peres', '8', '5', '0', '0', '4'],
['USS-Pathfinder', '4', '0', '0', '2', '1']]
My dict becomes:
{'USS-Enterprise': ['2', '2', '2', '2', '2'],
'USS-Voyager': ['2', '1', '0', '1', '1'],
'USS-Peres': ['8', '5', '0', '0', '4'],
'USS-Pathfinder': ['4', '0', '0', '2', '1']}
taking only the last ones, I want to add the values together. I am really confused.
You are trying to append multiple values for the same key. You can use defaultdict for that, or modify your code and utilize the get method for dictionaries.
for line in lines:
dictt[line[0]] = dictt.get(line[0], []).extend(line[1:])
This will look for each key, assign the line[1:] if the key is unique, and if it is duplicate, simply append those values onto the previous values.
dict_output = {}
for line in list_input:
if line[0] not in dict_output:
dict_output[line[0]] = line[1:]
else:
dict_output[line[0]] += line[1:]
EDIT: You subsequently clarified in comments that your input has duplicate keys, and you want later rows to overwrite earlier ones.
ORIGINAL ANSWER: The input is not a dictionary, it's a CSV file. Just use pandas.read_csv() to read it:
import pandas as pd
df = pd.read_csv('my.csv', sep='\s+', header=None)
df
0 1 2 3 4 5
0 USS-Enterprise 6 6 6 6 6
1 USS-Voyager 2 3 0 4 1
2 USS-Peres 10 4 0 0 5
3 USS-Pathfinder 2 0 0 1 2
4 USS-Enterprise 2 2 2 2 2
5 USS-Voyager 2 1 0 1 1
6 USS-Peres 8 5 0 0 4
7 USS-Pathfinder 4 0 0 2 1
Seems your input didn't have a header row. If your input columns had names, you can add them with df.columns = ['Ship', 'A', 'B', 'C', 'D', 'E'] or whatever.
If you really want to write a dict output (beware of duplicate keys being suppressed), see df.to_dict()

Group by with sum conditions [duplicate]

This question already has answers here:
Python Pandas Conditional Sum with Groupby
(3 answers)
Closed 4 years ago.
I have the following df and I'd like to group it by Date & Ref but with sum conditions.
In this respect I'd need to group by Date & Ref and sum 'Q' column only if P is >= than PP.
df = DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
df.groupby(['Date','Ref'])['Q'].sum() #This does the right grouping byt summing the whole column
df.loc[df['P'] >= df['PP'], ('Q')].sum() #this has the right sum condition, but does not divide between Date & Ref
Is there a way to do that?
Many thanks in advance
Just filter prior to grouping:
In[15]:
df[df['P'] >= df['PP']].groupby(['Date','Ref'])['Q'].sum()
Out[15]:
Date Ref
1 one 15
two 10
Name: Q, dtype: object
This reduces the size of the df in the first place so will speed up the groupby operation
You could do:
import pandas as pd
df = pd.DataFrame({'Date' : ['1', '1', '1', '1'],
'Ref' : ['one', 'one', 'two', 'two'],
'P' : ['50', '65', '30', '38'],
'PP' : ['63', '63', '32', '32'],
'Q' : ['10', '15', '20', '10']})
def conditional_sum(x):
return x[x['P'] >= x['PP']].Q.sum()
result = df.groupby(['Date','Ref']).apply(conditional_sum)
print(result)
Output
Date Ref
1 one 15
two 10
dtype: object
UPDATE
If you want to sum multiple columns in the output, you could use loc:
def conditional_sum(x):
return x.loc[x['P'] >= x['PP'], ['Q', 'P']].sum()
result = df.groupby(['Date', 'Ref']).apply(conditional_sum)
print(result)
Output
Q P
Date Ref
1 one 15.0 65.0
two 10.0 38.0
Note that in the example above I used column P for the sake of showing how to do it with multiple columns.

Python if/else statement confusion

How can you create an if else statement in python when you have a file with both text and numbers. Let's say I want to replace the values from the third to last column in the file below. I want to create an if else statement to replace values <5 or if there's a dot "." with a zero, and if possible to use that value as integer for a sum.
A quick and dirty solution using awk would look like this, but I'm curious on how to handle this type of data with python:
awk -F"[ :]" '{if ( (!/^#/) && ($9<5 || $9==".") ) $9="0" ; print }'
So how do you solve this problem?
Thanks
Input file:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:1:.:3
sample2 1 4 3 5 1:3:2:.:3:3
sample3 2 4 6 7 .:0:6:5:4:0
Desired output:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:0:0:3
sample2 1 4 3 5 1:3:2:0:3:3
sample3 2 4 6 7 .:0:6:5:4:0
SUM = 5
Result so far
['sample1', '1', '2', '3', '4', '1', '0', '2', '0', '0', '3\n']
['sample2', '1', '4', '3', '5', '1', '3', '2', '0', '3', '3\n']
['sample3', '2', '4', '6', '7', '.', '0', '6', '5', '4', '0']
Here's what I have tried so far:
import re
data=open("inputfile.txt", 'r')
for line in data:
if not line.startswith("#"):
nodots = line.replace(":.",":0")
final_nodots=re.split('\t|:',nodots)
if (int(final_nodots[8]))<5:
final_nodots[8]="0"
print (final_nodots)
else:
print(final_nodots)
data=open("inputfile.txt", 'r')
import re
sums = 0
for line in data:
if not line.startswith("#"):
nodots = line.replace(".","0")
final_nodots=list(re.findall('\d:.+\d+',nodots)[0])
if (int(final_nodots[6]))<5:
final_nodots[6]="0"
print(final_nodots)
sums += int(final_nodots[6])
print(sums)
You were pretty close but you your final_nodots returns a split on : instead of a split on the first few numbers, so your 8 should have been a 3. After that just add a sums counter to keep track of that slot.
['sample1 1 2 3 4 1', '0', '2', '0', '0', '3\n']
There are better ways to achieve what you want but I just wanted to fix your code.

Categories