Python : Pivot up values in data frame using lamda - python

I am reading a text file and using pandas and storing the details in Data Frame. Below is the Input file on which data frame is created :
SourceID|OrganizationName|AddressLine1|AddressLine2|City
1|Manor Drug Medical And Pharma|5795 N 1st St||Uta
1|Manor Drug Medical And Pharma|23230 Red River|Dr Ste 104|Evanston
On this data frame i am trying to pivot up the address information grouping by SourceID. Below is the expected output:
SourceID|OrganizationName|AddressLine1|AddressLine2|City
1|Manor Drug Medical And Pharma|5795 N 1st St^23230 Red River|^Dr Ste 104|Evanston^Uta
Below is the code used for same:
import pandas as pd
df = pd.read_csv('PivotingValues.txt', sep="|")
a=df[df.groupby("SourceID")['AddressLine1'].apply(lambda tags: '^'.join(tags)), df.groupby("SourceID")['City'].apply(lambda tags: '^'.join(tags))]
print(a)
Can you please help is achieving the same. Is there any other method other then lamda that can be used to achieve the same

Use -
df.groupby(['SourceID', 'OrganizationName'], as_index = False).agg('^'.join)
OR
df.groupby(['SourceID', 'OrganizationName'], as_index = False).agg({'AddressLine1': '^'.join, 'AddressLine2': '^'.join, 'City': '^'.join})
Output
SourceID OrganizationName AddressLine1 AddressLine2 City
1 Manor Drug Medical And Pharma 5795 N 1st St^23230 Red River ^Dr Ste 104 Uta^Evanston

Related

Scraping tables using Pandas read_html and identifying headers

I am completely new to web scraping and would like to parse a specific table that occurs in the SEC filing DEF 14A of companies. I was able to get the right URL and pass it to panda.
Note: Even though the desired table should occur in every DEF 14A, it's layout may differ from company to company. Right now I am struggling with formatting the dataframe.
How do I manage to get the right header and join it into a single index(column)?
This is my code so far:
url_to_use: "https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm"
resp = requests.get(url_to_use)
soup = bs.BeautifulSoup(resp.text, "html.parser")
dfs = pd.read_html(resp.text, match="Salary")
pd.options.display.max_columns = None
df = dfs[0]
df.dropna(how="all", inplace = True)
df.dropna(axis = 1, how="all", inplace = True)
display(df)
Right now the output of my code looks like this:
Dataframe output
Whereas the correct layout looks like this:
Original format
Is there some way to identify those rows that belong to the header and combine them as the header?
The table html is rather messed up. The empty cells are actually in the source code. It would be easiest to do some post processing:
import pandas as pd
import requests
r = requests.get("https://www.sec.gov/Archives/edgar/data/1000229/000095012907000818/h43371ddef14a.htm", headers={'User-agent': 'Mozilla/5.0'}).text
df = pd.read_html(r) #load with user agent to avoid 401 error
df = df[40] #get the right table from the list of dataframes
df = df[8:].rename(columns={i: ' '.join(df[i][:8].dropna()) for i in df.columns}) #generate column headers from the first 8 rows
df.dropna(how='all', axis=1, inplace=True) #remove empty columns and rows
df.dropna(how='all', axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
def sjoin(x): return ''.join(x[x.notnull()].astype(str))
df = df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1)) #concatenate columns with the same headers, taken from https://stackoverflow.com/a/24391268/11380795
Result
All Other Compensation ($)(4)
Change in Pension Value and Nonqualified Deferred Compensation Earnings ($)
Name and Principal Position
Non-Equity Incentive Plan Compensation ($)
Salary ($)
Stock Awards ($)(1)
Total ($)
Year
0
8953
(3)
David M. Demshur President and Chief Executive Officer
766200(2)
504569
1088559
2368281
2006
1
8944
(3)
Richard L. Bergmark Executive Vice President, Chief Financial Officer and Treasurer
330800(2)
324569
799096
1463409
2006
2
8940
(3)
Monty L. Davis Chief Operating Officer and Senior Vice President
320800(2)
314569
559097
1203406
2006
3
8933
(3)
John D. Denson Vice President, General Counsel and Secretary
176250(2)
264569
363581
813333
2006

Parse *.txt file looping with comprehensive dictionary

I have a *.txt file coming from a SQL query organised in rows.
I'm reading it with pandas library through:
df = pd.read_csv(./my_file_path/my_file.txt, sep = '\n', head = 0)
df.rename(columns = {list(df.columns)[0]: 'cols'}, inplace = True)
the output are rows with the information separated by spaces in an standard structure (dots are meant to be spaces):
name................address........country..........age
0 Bob.Hope............Broadway.......United.States....101
1 Richard.Donner......Park.Avenue....United.States.....76
2 Oscar.Meyer.........Friedrichshain.Germany...........47
I tried to create a dictionary to get the info with comprehensive lists:
col_dict = {'name': [df.cols[i][0:20].strip() for i in range(0,len(df.cols))],
'address': [df.cols[I][21:36].strip() for i in range(0,len(df.cols))],
'country': [df.cols[i][36:52].strip() for i in range(0,len(df.cols))],
'age': [df.cols[i][53:].strip() for i in range(0,len(df.cols))],
}
This script runs well in order to create a dictionary as a basis for a dataframe to work with. But I were asking myself if there is any other way to make the script more pythonic, looping directly through a dictionary with the column names and avoiding the repetition of the same code for every column -the actual dataset is much longer-.
The question is how can I store de string indexation to use it later with the column names to parse everything at once.
You can read it directly with pandas:
df = pd.read_csv(./my_file_path/my_file.txt, delim_whitespace=True)
If you know that the space between the columns is going to be at least 2 spaces, you can do it this way:
df = pd.read_csv(./my_file_path/my_file.txt, sep='\s{2,}')
In your case, the file is fixed width so you need to use a different method:
df = pd.read_fwf(StringIO(my_text), widths=[20,15,16, 10],skiprows=1)
The pandas.read_fwf method is what you are looking for.
df = pd.read_fwf( 'data.txt' )
data.txt
name address country age
Bob Hope Broadway United States 101
Richard Donner Park Avenue United States 76
Oscar Meyer Friedrichshain Germany 47
df
id
name
address
country
age
0
Bob Hope
Broadway
United States
101
1
Richard Donner
Park Avenue
United States
76
2
Oscar Meyer
Friedrichshain
Germany
47

how can I add duplicated rows to a Pandas DF?

I appreciate the help in advance!
The question may seem weird at first so let me illustrate what I am trying to accomplish:
I have this df of cities and abbreviations:
I need to add another column called 'Queries' and those queries are on a list as follows:
queries = ['Document Management','Document Imaging','Imaging Services']
The trick though is that I need to duplicate my df rows for each query in the list. For instance, for row 0 I have PHOENIX, AZ. I now need 3 rows saying PHOENIX, AZ, 'query[n]'.
Something that would look like this:
Of course I created that manually but I need to scale it for a large number of cities and a large list of queries.
This sounds simple but I've been trying for some hours now I don't see how to engineer any code for it. Again, thanks for the help!
Here is one way, using .explode():
import pandas as pd
df = pd.DataFrame({'City_Name': ['Phoenix', 'Tucson', 'Mesa', 'Los Angeles'],
'State': ['AZ', 'AZ', 'AZ', 'CA']})
# 'Query' is a column of tuples
df['Query'] = [('Doc Mgmt', 'Imaging', 'Services')] * len(df.index)
# ... and explode 'unpacks' the tuples, putting one item on each line
df = df.explode('Query')
print(df)
City_Name State Query
0 Phoenix AZ Doc Mgmt
0 Phoenix AZ Imaging
0 Phoenix AZ Services
1 Tucson AZ Doc Mgmt
1 Tucson AZ Imaging
1 Tucson AZ Services
2 Mesa AZ Doc Mgmt
2 Mesa AZ Imaging
2 Mesa AZ Services
3 Los Angeles CA Doc Mgmt
3 Los Angeles CA Imaging
3 Los Angeles CA Services
You should definitely go with jsmart's answer, but posting this as an exercise.
This can also be achieved by exporting the original cities/towns dataframe (df) to a list or records, manually duplicating each one for each query then reconstructing the final dataframe.
The entire thing can fit in a single line, and is even relatively readable if you can follow what's going on ;)
pd.DataFrame([{**record, 'query': query}
for query in queries
for record in df.to_dict(orient='records')])
new to python myself, but I would get around it by creating n (n=# of unique query values) identical data frames without "Query". Then for each of the data frame, create a new column with one of the "Query" values. Finally, stack all data frames together using append. A short example:
adf1 = pd.DataFrame([['city1','sate1'],['city2','state2']])
adf2 = adf1
adf1['query'] = 'doc management'
adf2['query'] = 'doc imaging'
df = adf1.append(adf2)
Another method if there are many types of queries.
Creating a dummy column, say 'key', in both the original data frame and the query data frame, and merge the two on 'key'.
adf = pd.DataFrame([['city1','state1'],['city2','state2']])
q = pd.DataFrame([['doc management'],['doc imaging']])
adf['key'] = 'key'
q['key'] = 'key'
df = pd.merge(adf, q, on='key', how='outer')
More advanced users should have better ways. This is a temporary solution if you are in a hurry.

How to group data in a DataFrame and also show the number of row in that group?

first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.

Rearrange CSV data

I have 2 csv files with different sequence of columns. For e.g. the first file starts with 10 digits mobile numbers while that column is at number 4 in the second file.
I need to merge all the customer data into a single csv file. The order of the columns should be as follows:
mobile pincode model Name Address Location pincode date
mobile Name Address Model Location pincode Date
9845299999 Raj Shah nagar No 22 Rivi Building 7Th Main I Crz Mumbai 17/02/2011
9880877777 Managing Partner M/S Aitas # 1010, 124Th Main, Bk Stage. - Bmw 320 D Hyderabad 560070 30-Dec-11
Name Address Location mobile pincode Date Model
Asvi Developers pvt Ltd fantry Road Nariman Point, 1St Floor, No. 150 Chennai 9844066666 13/11/2011 Crz
L R Shiva Gaikwad & Sudha Gaikwad # 42, Suvarna Mansion, 1St Cross, 17Th Main, Banjara Hill, B S K Stage,- Bangalore 9844233333 560085 40859 Mercedes_E 350 Cdi
Second task and that may be slightly difficult is that the new files expected may have a totally different column sequence. In that case I need to extract 10 digits mobile number and 6 digits pincode column. I need to write the code that will guess the city column if it matches with any of the given city list. The new files are expected to have relevant column headings but the column heading may be slightly different. for e.g. "customer address" instead of "address". How do I handle such data?
sed 's/.*\([0-9]\{10\}\).*/\1,&/' input
I have been suggested to use sed to rearrange the 10 digits column at the beginning. But I do also need to rearrange the text columns. For e.g. if a column matches the entries in the following list then it is undoubtedly model column.
['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
If any column matches 10% of the entries with the above list then it is a "model" column and should be at number 3 followed by mobile and pincode.
For your first question, I suggest using pandas to load both files and then concat. After that you can rearrange your columns.
import pandas as pd
dataframe1 = pd.read_csv('file1.csv')
dataframe2 = pd.read_csv('file2.csv')
combined = pd.concat([dataframe1, dataframe2]) #the columns will be ordered alphabetically
To get desired order,
result_df = combined[['mobile', 'pincode', 'model', 'Name', 'Address', 'Location', 'pincode', 'date']]
and then result_df.to_csv('oupput.csv', index=False) to export to csv file.
For the second one, you can do something like this (assuming you have loaded a csv file into df like above)
match_model = lambda m: m in ['Crz', 'Bmw 320 D', 'Benz', 'Mercedes_E 350 Cdi', 'Toyota_Corolla He 1.8']
for c in df:
if df[c].map(match_model).sum()/len(df) > 0.1:
print "Column %s is 'Model'"% c
df.rename(columns={c:'Model'}, inplace=True)
You can modify the matching function match_model to use regex instead if you want.

Categories