Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch IC Spider #230

Merged
merged 5 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 100 additions & 18 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,26 +1,108 @@
FROM --platform=linux/amd64 ubuntu:20.04
FROM --platform=linux/amd64 fedora:41

# Set timezone for tzdata
ENV TZ=UTC
## running as root
USER root

# Install tzdata non-interactively
RUN ln -fs /usr/share/zoneinfo/$TZ /etc/localtime && \
apt-get update && \
apt-get install -y tzdata
## shell for RUN cmd purposes
SHELL ["/bin/bash", "-c"]

# Update and install necessary packages
RUN apt-get update && apt-get upgrade -y ca-certificates && \
apt-get install -y curl unzip xvfb libxi6 libgconf-2-4 wget sudo git libxml2-dev libxslt1-dev
#####
## ## SYS Package Setup
#####

# Install Google Chrome
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb --no-check-certificate && \
apt -y install ./google-chrome-stable_current_amd64.deb && \
rm google-chrome-stable_current_amd64.deb
# LOCALE (important for python, etc.)
RUN dnf -y install glibc-locale-source glibc-langpack-en
RUN localedef -i en_US -f UTF-8 en_US.UTF-8

# Install ChromeDriver
RUN wget https://storage.googleapis.com/chrome-for-testing-public/123.0.6312.105/linux64/chromedriver-linux64.zip --no-check-certificate && \
unzip chromedriver-linux64.zip -d /usr/local/bin/ && \
rm chromedriver-linux64.zip
ENV LANG="en_US.UTF-8"
ENV LANGUAGE="en_US.UTF-8"
ENV LC_CTYPE="en_US.UTF-8"
ENV LC_NUMERIC="en_US.UTF-8"
ENV LC_TIME="en_US.UTF-8"
ENV LC_COLLATE="en_US.UTF-8"
ENV LC_MONETARY="en_US.UTF-8"
ENV LC_MESSAGES="en_US.UTF-8"
ENV LC_PAPER="en_US.UTF-8"
ENV LC_NAME="en_US.UTF-8"
ENV LC_ADDRESS="en_US.UTF-8"
ENV LC_TELEPHONE="en_US.UTF-8"
ENV LC_MEASUREMENT="en_US.UTF-8"
ENV LC_IDENTIFICATION="en_US.UTF-8"
ENV LC_ALL="en_US.UTF-8"

# Python3 and Env Prereqs
RUN dnf update -y \
&& dnf install -y \
autoconf \
automake \
binutils \
bison \
flex \
gcc \
gcc-c++ \
gettext \
libtool \
make \
patch \
pkgconfig \
redhat-rpm-config \
rpm-build \
rpm-sign \
byacc \
cscope \
ctags \
diffstat \
doxygen \
elfutils \
gcc-gfortran \
git \
indent \
intltool \
patchutils \
rcs \
subversion \
swig \
systemtap \
libxml2 \
libxslt \
&& dnf install -y \
wget \
python3.x86_64 \
python3-devel.x86_64 \
python3-pip.noarch \
bzip2 \
glibc.i686 \
zip \
unzip \
&& dnf clean all \
&& rm -rf /var/cache/dnf

# Update base python setup packages (avoids
RUN pip3 install --no-cache-dir --upgrade pip wheel setuptools

#####
## ## Chrome & ChromeDriver Setup
#####

## Installing AWS CLI
RUN curl -LfSo /tmp/awscliv2.zip "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" \
&& unzip -q /tmp/awscliv2.zip -d /opt \
&& /opt/aws/install

## Getting chrome browser
RUN wget --no-check-certificate https://dl.google.com/linux/chrome/rpm/stable/x86_64/google-chrome-stable-120.0.6099.109-1.x86_64.rpm -P /tmp/ \
&& dnf install /tmp/google-chrome-stable-120.0.6099.109-1.x86_64.rpm -y \
&& rm /tmp/google-chrome-stable-120.0.6099.109-1.x86_64.rpm

## Getting chrome driver
RUN wget --no-check-certificate -O /tmp/chromedriver.zip https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linux64.zip \
&& unzip /tmp/chromedriver.zip chromedriver-linux64/chromedriver -d /tmp/ \
&& mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/ \
&& rm -rf /tmp/chromedriver*

#####
## ## Python packages
#####

# Install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_22.11.1-1-Linux-x86_64.sh && \
Expand Down
5 changes: 4 additions & 1 deletion dataPipelines/gc_scrapy/gc_scrapy/doc_item_fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,10 @@ def __init__(
self.display_doc_type = doc_type
else:
self.display_doc_type = display_doc_type
self.publication_date = publication_date.strftime("%Y-%m-%dT%H:%M:%S")
try:
self.publication_date = publication_date.strftime("%Y-%m-%dT%H:%M:%S")
except AttributeError:
self.publication_date = None
self.cac_login_required = cac_login_required
self.source_page_url = source_page_url
self.downloadable_items = downloadable_items
Expand Down
Loading
Loading