Performing Analysis of Meteorological Data

Meteorological Data Analysis

by Sayantan Bhattacharyya

Language used: Python

Libraries used: Numpy, Pandas, Matplotlib, Seaborn

Overview:

In this project, we are doing hypothesis testing on whether the dataset and trying to prove that hypothesis is correct or not. We are also doing some data cleaning techniques, Data visualization, and hypothesis testing.

The Null Hypothesis, H0 is "Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming"

The H0 means we need to find whether the average Apparent temperature for the month of a month say April starting from 2006 to 2016 and the average humidity for the same period has increased or not.

Monthly analysis has to be done for all 12 months over the 10 year period.

So basically, we have to resample our data from hourly to monthly, then comparing the same month over the 10 year period. Then we support our analysis with appropriate visualizations using matplotlib and seaborn library.

Step 1:

Import the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import seaborn as sbn

Step 2:

Reading the data

df = pd.read_csv("weatherHistory.csv")
df

Step 3:

Describing the data

df.describe()

Checking for null values

df.isnull().sum()

We see that the column 'Precip Type' has 517 null type values. Now we check for non-null type values.

df.info()

Step 4:

Data Cleaning

#Removing the null values
new_df = df.dropna()
new_df

Checking the info of the data frame after removing the null values.

new_df.info()

new_df.describe()

Step 5 :

Resampling the data(pre-processing)

new_df['Formatted Date'] = pd.to_datetime(new_df['Formatted Date']  , utc =True)
new_df = new_df.set_index('Formatted Date')

resampled_df = (new_df.resample('M')).mean() # resample accroading to Month end ('M')
resampled_df

Step 6:

Plotting a graph of Humidity 2006-2016

plt.figure(figsize = (15,8))
hum_plot = sbn.lineplot(y = resampled_df['Humidity'], x = resampled_df.index,data = resampled_df)
hum_plot.set_xlabel("Year", fontsize = 15)
hum_plot.set_ylabel("Humidity", fontsize = 15)
plt.title("Humidity plot [2006-2016]")

Step 7:

plt.figure(figsize = (15,8))
temp_plot = sbn.lineplot(y = resampled_df['Apparent Temperature (C)'], x = resampled_df.index,data = resampled_df)
temp_plot.set_xlabel("Year", fontsize = 15)
temp_plot.set_ylabel("Apparent Temperature (C)", fontsize = 15)
plt.title("Apparent Temperature plot [2006-2016]")

Step 8:

plt.figure(figsize = (15,8))
tem_plot = sbn.lineplot(y = resampled_df['Temperature (C)'], x = resampled_df.index,data = resampled_df,color ='blue')
tem_plot = sbn.lineplot(y = resampled_df['Apparent Temperature (C)'], x = resampled_df.index,data = resampled_df, color ='green')
plt.legend(labels=[" Temperarture(C)","Apparent Temperarture(C)"])
tem_plot.set_xlabel("Year", fontsize = 15)
tem_plot.set_ylabel("Temperature(C)", fontsize = 15)
plt.title("Temperature Vs. Apparent Temperature [2006-2016]")

Step 9:

new_df['month'] = new_df.index.month
new_df['year'] = new_df.index.year
avg_data_tempreature_monthly = {}
for year in range(2006,2017):
    for month in range(1,13):
        result = list(new_df.loc[(new_df['month'] == month)&(new_df['year']==year) , :]['Apparent Temperature (C)'].values)
        if month not in avg_data_tempreature_monthly:
            avg_data_tempreature_monthly[month] = [np.mean(result)]
        else:
            avg_data_tempreature_monthly[month].append(np.mean(result))
TM = pd.DataFrame(avg_data_tempreature_monthly)
TM['year'] = range(2006,2017)
title = {1:'Jan',2:'Feb',3:'March',4:'April',5:'May',6:'June',7:'July',8:'Aug',9:'Sep',
         10:'Oct',11:'Nov',12:'Dec'}
for month in range(1,13):
    sbn.barplot(x = TM['year'] , y = TM[month])
    
    plt.title('Bar plot for Month :' + title[month])
    plt.ylabel("Mean Apparent Temperature(C)")
    plt.xlabel("Year")
    plt.show()

Step 10:

Plotting monthly graphs of Average Humidity 2006-2016

avg_data_humidity_monthly = {}
for year in range(2006,2017):
    for month in range(1,13):
        result = list(new_df.loc[(new_df['month'] == month)&(new_df['year']==year) , :]['Humidity'].values)
        if month not in avg_data_humidity_monthly:
            avg_data_humidity_monthly[month] = [np.mean(result)]
        else:
            avg_data_humidity_monthly[month].append(np.mean(result))

HM = pd.DataFrame(avg_data_humidity_monthly)
HM['year'] = range(2006,2017)

for month in range(1,13):
    sbn.barplot(x = HM['year'] , y = HM[month])
    
    plt.title('Bar plot for Month :' + title[month])
    plt.ylabel("Average Humidity")
    plt.xlabel("Year")
    plt.show()

Step 10:

Plotting monthly graphs of Apparent Temperature(C) Vs. Humidity 2006-2016

for month in range(1,13):
    plt.plot(range(2006,2017),avg_data_tempreature_monthly[month] , label = 'Apparent Temperature(C)' , color = 'red')
    plt.plot(range(2006,2017),avg_data_humidity_monthly[month] , label = 'Humidity')
    plt.legend()
    plt.title('Apparent Temperature Vs. Humidity for Month : '+ title[month])
    plt.show()

Conclusion:

From the visualization, we can see that the monthly average humidity is nearly the same from 2006-2016, but this is not the case with the monthly average apparent temperature from 2006-2016. So we can conclude that global warming is affecting the apparent temperature and not humidity.

GitHub link--

https://github.com/sayantan-bhattacharyya/Performing-Analysis-of-Meteorological-Data

I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com

Search This Blog

Sayantan Codes