Pandas dataframes are an effective tool in the Python programming language for handling and manipulating structured data. Moreover, when data is distributed throughout several dataframes and they all share a common column that acts as a key or identifier, merging the dataframes becomes essential. This process allows you to produce a single dataset that offers a comprehensive picture of the data. The purpose of joining two pandas dataframes on a common column in Python is to consolidate and combine related information from different data sources. In this article, we will discuss different methods that can be used to join two pandas dataframe on a common column in python.
If you want to learn more about Python Programming, visit Python Programming Tutorials.
Methods To Join Two Pandas Dataframe
Python offers a variety of methods for integrating dataframes, providing you the option to select the strategy that most closely matches your objectives for data processing. In this section, we will discuss the following methods in detail:
- Join two pandas dataframe using the merge method
- Join Two Common Columns Using Concatenate Method
- using the join method for merging dataframes
To achieve this, we can use the merge()
function. The merge function joins a dataframe by using a specified row as the key and an additional column as the value. Additionally, for more complex scenarios involving dataframes with corresponding column types, the concatenate
method proves to be a valuable. Another effective method is the join() method which provides further options for dataframe integration based on common columns. Lets discuss these three methods in detail along with examples.
1) Join two Pandas Dataframe using the merge method
One effective method to accomplish this task is to use the merge function. This function allows you to combine the two dataframes into one unified structure. In this way, the shared column is populated with corresponding values from both dataframes.
Syntax: Dataframe.merge(right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
Following table gives a brief explanation of all the parameters:
Parameter | Default Value | Description |
right | – | It specifies the dataframes that you want to merge. |
how | inner | It determines the type of join – whether it’s an inner, outer, left, or right join. |
on | None | It’s the common column name where the two DataFrames will be joined. Both DataFrames must have this column. |
left_on | False | This is the column or multiple columns in the left DataFrame used for joining. |
right_on | False | It specifies the column used for joining from the right DataFrame. It can also be an array of multiple columns. |
left_index | False | If set to True , the function will use the index of the left DataFrame for the join operation. This means that instead of specifying a column to join on, the index values of the left DataFrame will be used. |
right_index | False | Similarly, if set to True , the function will use the index of the right DataFrame as the basis for the join. |
sort | False | It determines whether the values used to combine two DataFrames will be arranged in alphabetical order. If you set it to True , the combined data will be sorted. If you set it to False , the order will depend on the specific type of join you’re performing. |
suffixes | ‘_x’, ‘_y’ | It is a way to label or tag columns with similar names to differentiate them when merging dataframes. |
copy | True | If set to False , the function avoids copying data if possible. |
Indicator | False | If you set it to True , it creates a new column in the merged DataFrame called “_merge”. This column tells you where each row comes from. |
validate | None | It ensures that the merging process meets certain conditions. |
The output of a merge function is a new dataframe which consists of merged values. It does not modify the original dataframes, but create a new dataframe.
Merging Pandas DataFrames Based on Common ID Columns
We can use merge method to create a new combined DataFrame that retains data from both original DataFrames but only for the rows where the IDs match. This process helps consolidate information from different sources while ensuring that only relevant data is included in the merged result. In this example code, we’re using the pd.merge()
function to perform this operation and generate a new DataFrame that includes details from both DataFrame 1 and DataFrame 2, where the IDs are the same.
import pandas as pd
# Create DataFrame 1
data1 = {'ID': [1, 2, 3],
'Name': ['John', 'Mary', 'Alice'],
'Age': [25, 30, 28]}
df1 = pd.DataFrame(data1)
# Create DataFrame 2
data2 = {'ID': [1, 2, 4],
'City': ['New York', 'London', 'Berlin'],
'Salary': [50000, 60000, 45000]}
df2 = pd.DataFrame(data2)
df3=pd.merge(df1, df2)
print("\nDataFrames:")
print(df3)
Output:
DataFrames:
ID Name Age City Salary
0 1 John 25 New York 50000
1 2 Mary 30 London 60000
Using the ‘how’ Parameter for Customized DataFrame Merging
When merging multiple Dataframes, you can alter how the data is combined based on the relationship between the common columns by using the ‘how’ parameter. This parameter specifies the kind of join to be used, such as an inner, outer, left, or right join. Depending on which “how” value you choose, you can decide whether to include only the rows that match, include all rows from both DataFrames, or prioritize one DataFrame’s information over the other.
As discussed above in the table, the how parameter describes the type of join operation to be performed between the dataframes. Now, let’s see how changing the value of the how parameter affects the output.
Consider an example in which we have two pandas DataFrames representing student records and course enrollments. The two dataframes have a common column, “Student_ID,” which can be used for merging and analyzing the data.
import pandas as pd
data1 = {'Student_ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [18, 19, 20, 18, 21],
'Grade': ['A', 'B', 'C', 'B', 'A']}
df1 = pd.DataFrame(data1)
data2 = {'Student_ID': [2, 4, 6, 1, 3],
'Course': ['Math', 'Physics', 'History', 'Chemistry', 'Biology'],
'Credits': [3, 4, 3, 4, 3]}
df2 = pd.DataFrame(data2)
Now, we will merge these two dataframes and see how the value of ‘how’ parameter affects the output.
Outer Join
All rows from both dataframes are included, and are aligned based on the specified common column(s).
#Outer Join
outer_merging = pd.merge(df1, df2, how="outer")
print(outer_merging)
Output:
Student_ID Name Age Grade Course Credits
0 1 Alice 18.0 A Chemistry 4.0
1 2 Bob 19.0 B Math 3.0
2 3 Charlie 20.0 C Biology 3.0
3 4 David 18.0 B Physics 4.0
4 5 Emma 21.0 A NaN NaN
5 6 NaN NaN NaN History 3.0
If any data is missing in the common column(s), NaN values are used to fill those gaps in the merged DataFrame.
Inner Join
In inner join, only the rows with matching values in the specified common column(s) will be included in the resulting merged DataFrame.
#Inner Join
inner_merging = pd.merge(df1, df2, how="inner")
print(inner_merging)
Output:
Student_ID Name Age Grade Course Credits
0 1 Alice 18 A Chemistry 4
1 2 Bob 19 B Math 3
2 3 Charlie 20 C Biology 3
3 4 David 18 B Physics 4
Right Join
The right join retains all rows from the right DataFrame and combines them with matching rows from the left DataFrame. If there’s no match in the left DataFrame, it fills the resulting DataFrame with NaN or specified fill values.
#Right Join
right_merging = pd.merge(df1, df2, how="right")
print(right_merging)
Output:
Student_ID Name Age Grade Course Credits
0 2 Bob 19.0 B Math 3
1 4 David 18.0 B Physics 4
2 6 NaN NaN NaN History 3
3 1 Alice 18.0 A Chemistry 4
4 3 Charlie 20.0 C Biology 3
Left Join
This retains all rows from the left DataFrame and combines them with matching rows from the right DataFrame. If there’s no match in the right DataFrame, it fills the resulting DataFrame with NaN or specified fill values.
#Left Join
left_merging = pd.merge(df1, df2, how="left")
print(left_merging)
Output:
Student_ID Name Age Grade Course Credits
0 1 Alice 18 A Chemistry 4.0
1 2 Bob 19 B Math 3.0
2 3 Charlie 20 C Biology 3.0
3 4 David 18 B Physics 4.0
4 5 Emma 21 A NaN NaN
Using the ‘on’ parameter to specify column for merging dataframes
When merging two DataFrames, the ‘on’ parameter defines which column(s) will be used as the key to match and merge the two DataFrames. The data from both DataFrames are then combined by aligning rows based on the values in the selected column(s). Lets understand it through an example.
In this example, two Pandas DataFrames, ‘flower’ and ‘test’, are created. The ‘flower’ DataFrame contains information about different types of flowers and their corresponding test attributes. The ‘test’ DataFrame includes flower names and their associated clusters. The goal is to merge these two DataFrames based on the common column ‘flower’.
The pd.merge()
function is used to perform the merging operation. The parameters used in the merge function are as follows:
flower
andtest
: The DataFrames being merged.on='flower'
: Specifies that the merging operation should be based on the ‘flower’ column, which is the common key between the DataFrames.how='outer'
: Indicates that an outer join should be performed.
The resulting merged DataFrame ‘analysis’ contains information from both original DataFrames, where rows are aligned based on the ‘flower’ column. The code is given below:
import pandas as pd
# Create the first DataFrame 'flower'
flower = pd.DataFrame({
'flower': ['Red Ginger', 'Tree Poppy', 'passion flower', 'water lily'],
'test': ['similarities', 'accuracy', 'correctness', 'classification']
}, index=[0, 1, 2, 3])
# Create the second DataFrame 'test'
test = pd.DataFrame({
'flower': ['Red Ginger', 'Tree Poppy', 'rose flower', 'sun flower'],
'cluster': ['cluster_1', 'cluster_2', 'cluster_3', 'cluster_4']
}, index=[4, 5, 6, 7])
# Merge the two DataFrames using the 'flower' column as the key
analysis = pd.merge(flower, test, on='flower', how='outer')
# Display the merged DataFrame 'analysis'
print(analysis)
Output:
flower test cluster
0 Red Ginger similarities cluster_1
1 Tree Poppy accuracy cluster_2
2 passion flower correctness NaN
3 water lily classification NaN
4 rose flower NaN cluster_3
5 sun flower NaN cluster_4
The main function of pd.merge() is to combine DataFrames based on common columns or keys. It’s particularly useful when you want to merge data from different sources while aligning them based on specific columns.
2) Join two common columns using the concatenate method
The pd.concat()
method in pandas is used to concatenate multiple DataFrames along a specified axis. It allows you to combine DataFrames vertically (stacking) or horizontally (side-by-side) based on their indexes or columns. This method is helpful when you want to merge data from multiple sources or create a larger DataFrame by combining smaller ones.
The pd.concat() has an axis parameter which specify the axis along which the concatenation occurs (0 for rows, 1 for columns). By default, the axis parameter is set to 0. Lets see through an example how pd.concat() function works.
import pandas as pd
# Create DataFrame 1: Employee Data
data_employee = {'EmployeeID': [101, 102, 103],
'Name': ['John', 'Alice', 'Bob'],
'Position': ['Manager', 'Engineer', 'Analyst']}
df_employee = pd.DataFrame(data_employee)
# Create DataFrame 2: Employee Projects
data_projects = {
'Project': ['Website Redesign', 'Financial Analysis', 'Product Development'],
'StartYear': [2022, 2023, 2022],
'Status': ['Completed', 'Ongoing', 'Pending']}
df_projects = pd.DataFrame(data_projects)
# Concatenate the two DataFrames horizontally
concatenated_df = pd.concat([df_employee, df_projects], axis=1)
print(concatenated_df)
Output:
EmployeeID Name Position Project StartYear Status
0 101 John Manager Website Redesign 2022 Completed
1 102 Alice Engineer Financial Analysis 2023 Ongoing
2 103 Bob Analyst Product Development 2022 Pending
In this example, pd.concat()
is used to horizontally concatenate the two DataFrames (df_employee
and df_projects
) along the columns. The resulting DataFrame, concatenated_df
, combines employee information and project details, creating a more comprehensive dataset for analysis and reporting. The axis=1
parameter indicates that concatenation should be performed along the columns.
You can also specify additional options like ignore_index
to reset the index of the resulting DataFrame or keys
to create a hierarchical index. Here’s an example of how you can use the pd.concat()
method to combine two DataFrames vertically (along rows):
import pandas as pd
# Create two DataFrames
data1 = {'Name': ['John', 'Mary', 'Alice'],
'Age': [25, 30, 28]}
df1 = pd.DataFrame(data1)
data2 = {'Name': ['Bob', 'Emma', 'David'],
'Age': [22, 27, 29]}
df2 = pd.DataFrame(data2)
# Concatenate DataFrames vertically
combined_df = pd.concat([df1, df2])
print("Combined DataFrame:")
print(combined_df)
Output:
Combined DataFrame:
Name Age
0 John 25
1 Mary 30
2 Alice 28
0 Bob 22
1 Emma 27
2 David 29
From the above example, you can see that the indices are repeated in the concatenated DataFrame. Each row retains its original index value from the source DataFrames, resulting in repeated index values in the concatenated DataFrame. This behavior might not be desirable in cases where a consistent and unique index is preferred for the concatenated DataFrame. To address this, you can set ignore_index
to True
, which will reset the index and provide a new index sequence to the concatenated DataFrame, ensuring index consistency and uniqueness. Here’s the code:
# Concatenate DataFrames vertically
combined_df = pd.concat([df1, df2], ignore_index=True)
print("Combined DataFrame:")
print(combined_df)
Output:
Combined DataFrame:
Name Age
0 John 25
1 Mary 30
2 Alice 28
3 Bob 22
4 Emma 27
5 David 29
The concatenation method is particularly useful when you want to combine datasets from multiple sources, create comprehensive reports, or conduct analysis on diverse datasets.
3) Using Join() Method To Concatenate A Common Column In Python
The join()
method allows you to combine data from different DataFrames by aligning rows with matching values in the specified column. It simplifies the process of merging data and can be especially useful when working with related datasets that share a common identifier. The following example demonstrates how you can use join() function:
import pandas as pd
col_1 = pd.read_csv("ICC MEN'S CRICKET WORLD CUP HIGH SCORES.csv")
col_2 = pd.read_csv("ICC MEN'S HIGH SCORES.csv")
print ('col_1:', '\n', col_1)
print('\n ')
print ('col_2:', '\n', col_2)
Output:
col_1:
Player Runs 4s 6s Unnamed: 4 Team Opposition
0 Fakhar Zaman 193 155 10 NaN Pakistan v South Africa
1 JN Malan 177* 169 16 NaN South Africa v Ireland
2 JC Buttler 162* 70 14 NaN England v Netherlands
3 Ibrahim Zadran 162 138 4 NaN Afghanistan v Sri Lanka
4 Babar Azam 158 139 4 NaN Pakistan v England
col_2:
Player Runs Balls
0 Fakhar Zaman 193 155
1 JN Malan 177* 169
2 JC Buttler 162* 70
3 Ibrahim Zadran 162 138
4 Babar Azam 158 139
result = col_2.join(col_1,lsuffix="_x")
result
Output:
Conclusion
In the context of this tutorial, we have explored how to merge two pandas dataframes based on a shared column using the merge() function. This approach enables the combination of data from separate dataframes by utilizing a specified column as a key and an accompanying column as a value. For more complex scenarios involving dataframes with corresponding column types, the concat() method proves to be an essential tool. It’s worth noting that while the versatile concat() method can also combine dataframes by columns, the merge() function stands as the recommended and optimal choice for joining dataframes on a common column in Python.