I’m in the process of developing a pipeline that runs in Google Cloud and uses Cloud Scheduler for execution at specified intervals. My setup involves accessing an SFTP server to retrieve two XLS files using one module, processing those files into JSON format with another module, and finally sending a PATCH request to Airtable to refresh specific records with a third module. Initially, I considered organizing these functions in separate files, but I opted to consolidate them into a single main.py file, which also includes a requirements.txt for dependencies.
The main entry point of the Cloud Function is defined as follows:
import os
import pysftp
import pandas as pd
import requests
import json
def fetch_data_for_analysis(ftp_directory, file_names):
'''
Connects to an SFTP server, retrieves specified XLS files, and reads them into dataframes.
'''
HOST = os.environ.get('SFTP_HOST')
USER = os.environ.get('SFTP_USER')
PASSWORD = os.environ.get('SFTP_PASS')
PORT = int(os.environ.get('SFTP_PORT'))
data_frames = []
options = pysftp.CnOpts()
options.hostkeys = None
with pysftp.Connection(HOST, username=USER, password=PASSWORD, port=PORT, cnopts=options) as sftp:
print('Connected to SFTP')
files = sftp.listdir()
for file_name in file_names:
with sftp.open(ftp_directory + file_name) as file:
df = pd.read_excel(file)
data_frames.append(df)
return data_frames
However, I face the following error:
500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
In the logs, I see this snippet:
File "/layers/google.python.pip/pip/lib/python3.11/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()