Pandas, merge multiple CSV's into one and clean strings

Pandas, merge multiple CSV's into one and clean strings - python

I am having some issues using pandas to manipulate some files into the right format.
I have multiple CSV's that I want to merge into one file.
They all have this sort of structure,
data1, data2 etc
And then with the key values, which are area (ideally want to rename column there), any arearea that has two words in it has looks like this, [![enter image description here][2]][2] , with "%20" instead of a space, and I'd want to remove that %20 ideally.
And lastly, I want to have the arearea have their respective long and lats from this file here,
If anyone has some pointers on achieving this, that would be amazing, I'm stuck using df.merge, and keep receiving errors. So close to achieving getting it complete!

You can use pd.concat (pandas documentation). As for string cleaning, you can apply a lambda function after the fact. Here's an example:
df1 = pd.read_csv('file1.csv')
df1
a b c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat%20dog
df2 = pd.read_csv('file2.csv')
df2
a b c
0 7 8 bus
1 9 10 boat
2 11 12 ferry%20plane
# use pd.concat
df = pd.concat([df1, df2]).reset_index(drop=True)
df
a b c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat%20dog
3 7 8 bus
4 9 10 boat
5 11 12 ferry%20plane
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['clean_c'] = df['c'].apply(f)
df[['a', 'b', 'clean_c']]
a b clean_c
0 1 2 aardvark
1 3 4 pangolin
2 5 6 cat dog
3 7 8 bus
4 9 10 boat
5 11 12 ferry plane

Related

Add label/multi-index on top of columns

Context: I'd like to add a new multi-index/row on top of the columns. For example if I have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
Desired Output: How could I make it so that I can add "Table X" on top of the columns A,B, and C?
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Possible solutions(?): I was thinking about transposing the dataframe, adding the multi-index, and transpose it back again, but not sure how to do that without having to write the dataframe columns manually (I've checked other SO posts about this as well)
Thank you!

In the meantime I've also discovered this solution:
tt = pd.concat([tt],keys=['Table X'], axis=1)
Which also yields the desired output
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9

If you want a data frame like you wrote, you need a Multiindex data frame, try this:
import pandas as pd
# you need a nested dict first
dict_nested = {'Table X': {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]}}
# then you have to reform it
reformed_dict = {}
for outer_key, inner_dict in dict_nested.items():
for inner_key, values in inner_dict.items():
reformed_dict[(outer_key, inner_key)] = values
# last but not least convert it to a multiindex dataframe
multiindex_df = pd.DataFrame(reformed_dict)
print(multiIndex_df)
# >> Table X
# >> A B C
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9

You can use pd.MultiIndex.from_tuples() to set / change the columns of the dataframe with a multi index:
tt.columns = pd.MultiIndex.from_tuples((
('Table X', 'A'), ('Table X', 'B'), ('Table X', 'C')))
Result (tt):
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Add-on, as those are multi index levels you can later change them:
tt.columns.set_levels(['table_x'],level=0,inplace=True)
tt.columns.set_levels(['a','b','c'],level=1,inplace=True)
table_x
a b c
0 1 4 7
1 2 5 8
2 3 6 9

How do I transpose columns into rows of a Pandas DataFrame?

My current data frame is comprised of 10 rows and thousands of columns. The setup currently looks similar to this:
A B A B
1 2 3 4
5 6 7 8
But I desire something more like below, where essentially I would transpose the columns into rows once the headers start repeating themselves.
A B
1 2
5 6
3 4
7 8
I've been trying df.reshape but perhaps can't get the syntax right. Any suggestions on how best to transpose the data like this?

I'd probably go for stacking, grouping and then building a new DataFrame from scratch, eg:
pd.DataFrame({col: vals for col, vals in df.stack().groupby(level=1).agg(list).items()})
That'll also give you:
A B
0 1 2
1 3 4
2 5 6
3 7 8

Try with stack, groupby and pivot:
stacked = df.T.stack().to_frame().assign(idx=df.T.stack().groupby(level=0).cumcount()).reset_index()
output = stacked.pivot("idx", "level_0", 0).rename_axis(None, axis=1).rename_axis(None, axis=0)
>>> output
A B
0 1 2
1 5 6
2 3 4
3 7 8

How to do a vlookup in python and use SQL's LIKE operator as input

I have two datasets, qlik_clean and synapse_clean. Qlik_clean contains 1 column of type string (col= field) and synapse_clean contains of 2 columns (cols= field, model), both of which are of type string.
My goal is to compare and find out which rows of col field in qlik_clean are also in col field in synapse_clean and return either true OR the value of col model for that specific row.
I have tried using pandas and lambda functions, but with no success. I have converting the df to a dict and go from there, but with no success either.
Could someone help out and point me in the right direction? I have my code below, I haven't added any comparison logic yet, would anyone know the way to go here?
Kind regards,
Rutger
import pandas as pd
import openpyxl
import re
#open synapse fields file
synapsefilepath = ''
synapse = pd.read_excel(synapsefilepath, engine='openpyxl')
#create df from synapse file
synapse_df = pd.DataFrame(synapse)```
#transform fields into list
synapse_dict = synapse_df.to_dict('list')
#all keys and values to lower
synapse_clean = {k.lower(): str(v).lower() for k, v in synapse_dict.items()}
##### QLIK FILE #######
#open qlik file
qlikfilepath= ''
qlik = pd.read_excel(qlikfilepath, engine='openpyxl', sheet_name='qliksynapsecomp')
#create df from excel file and keep only qlikfieldname and insynapse
qlik_df = pd.DataFrame(qlik)
qlik_df_trunc = qlik_df[['field']]
#transform fields into list
qlik_dict = qlik_df_trunc.to_dict('list')
#lowercase all keys and values
qlik_clean = {ke.lower().strip(): str(va).lower().strip() for ke, va in qlik_dict.items()}

I can't get a good idea of your dataframes structures from the posted code, but I think this should answer your question.
To filter df1 by values in df1.col1 that are in df2.col1:
df1[df1.col1.isin(df2.col1)]
Ex;
>>> df1
A B C D
0 6 5 9 6
1 4 1 6 3
2 3 4 6 9
3 0 6 0 5
4 1 6 3 3
>>> df2
A B C D
0 6 4 5 4
1 4 3 6 9
2 2 2 8 5
3 6 6 7 3
4 7 0 1 7
>>> df1[df1.A.isin(df2.A)]
A B C D
0 6 5 9 6
1 4 1 6 3
You had mentioned something about a SQL LIKE statement, but also mentioned that you are not working with strings. If I misunderstood and you need to do partial string matching, let me know.

Selecting rows from one DataFrame depending on values from another

Take two data frames
print(df1)
A B
0 a 1
1 a 3
2 a 5
3 b 7
4 b 9
5 c 11
6 c 13
7 c 15
print(df2)
C D
a apple 1
b pear 1
c apple 1
So the values in column df1['A'] are the indexes of df2.
I want to select the rows in df1 where the values in column A are 'apple' in df2['C']. Resulting in:
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15

Made many edits due to comments and question edits,
Basically you first extract the indexes of df2 by filtering the dataframe by values in C, then filter the df2 by indexes with isin
indexes = df2[df2['C']=='apple'].index
df1[df1['A'].isin(indexes)]
>>>
A B
0 a 1
1 a 3
2 a 5
5 c 11
6 c 13
7 c 15
UPDATE
If you want to minimize memory allocation try to prevent saving information, (note. That i am not sure ot will solve your menory allocation issue because i didnt have full details of the situation and maybe even not suited enough to provide a solution):
df1[df1['A'].isin( df2[df2['C']=='apple'].index)]

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas

df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.

Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5

Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas, merge multiple CSV's into one and clean strings - python

Related

Add label/multi-index on top of columns

How do I transpose columns into rows of a Pandas DataFrame?

How to do a vlookup in python and use SQL's LIKE operator as input

Selecting rows from one DataFrame depending on values from another

Pandas: remove old DataFrame from memory after groupby

Categories

Resources