Platform: AWS EC2 (DLAMI) ✅

This approach worked for me a few days ago, but I had to start over from scratch and now the kernel is dying as soon as I create a ConvLearner in notebook 1.

These instructions seem to be working: https://github.com/krishnakalyan3/course-v3/blob/aed64af19b34bcf0ddf1263bfd7d0e1744aac884/docs/start_aws.md

4 Likes

That worked for me. Couldn’t get other AWS EC2’s working

1 Like

I created a p2.xlarge then ran the commands

On ‘sudo apt-get install cuda’ it unpacks cuda files

but I am getting errors like the following…

Unpacking cuda-libraries-10-0 (10.0.130-1) …
dpkg: unrecoverable fatal error, aborting:
unable to flush /var/lib/dpkg/updates/tmp.i after padding: No space left on device
E: Sub-process /usr/bin/dpkg returned an error code (2)

ubuntu@ip-10-0-0-66:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Looks like DLAMI is having lot of problems, I installed as per instructions somehow fastai library could not be properly installed and I keep getting errors like undefined for example on the untar_data method. Of course I follwed the FAQ guide to uninstall them but no success. Moving to easier crestle.ai option.

1 Like

The installation of libraries failed because the hard drive (EBS) is too small (the default it 8GB).

When setting up you should change the size (under volumes in the “Review” page) to something bigger to handle the libraries and the data (I used 75GB).

Now you should be able to resize the volume (but you might need to repartition the hard drive) or add a new volume, but if you’re not too attached to the instance it might be easier to start over.

In the course, Ubuntu AMI is used. I noticed there is also an Amazon linux type of AMI and I can run lesson 1 with it no problem.

Is there any difference between the Ubuntu AMI and Amazon Linux AMI?

Any suggestions?

Just try to bump the above question.

Thanks :grinning:

I am getting conda update conda error…

Pasting below the error :

Blockquote
ubuntu@ip-172-31-30-171:~$ conda update conda

CorruptedEnvironmentError: The target environment has been corrupted. Corrupted environments most commonly
occur when the conda process is force-terminated while in an unlink-link
transaction.
environment location: /home/ubuntu/anaconda3
corrupted file: /home/ubuntu/anaconda3/conda-meta/fastai-1.0.50.post1-1.json

Blockquote

Is there anything which can be done to solve the error

I followed the current instructions for aws exactly and tried both DLAMI Ver 16 and current (23.1). When I get to the step to run jupyter notebook I get the following error in the ssh-conntected terminal:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/jupyter", line 7, in <module>
    from jupyter_core.command import main

I tried googling this for a few hours but to no avail. Any help would be much appreciated.
ModuleNotFoundError: No module named ‘jupyter_core’

1 Like

@davidpfahler Was this fixed? I am in the same situation. I tried conda install -c anaconda jupyter_core, still it fails due to environment inconsistency.

I still have the same issue. I will post, if I get it to work.

Seems like I got it to work. Before conda install -c pytorch -c fastai fastai pytorch torchvision cuda92 I did the following:

conda init bash
source ~/.bashrc
conda activate pytorch_p36

That seems to to the trick. I am working a jupyter notebook right now.

3 Likes

Thanks. That worked!

Let me first say that I am a newbie using anaconda and EC2-instances.

I ran into what I figure is a similar problem as the one you describe. Also tried your solution (conda init bash etc) which at least allowed me to fire up Jupyter. But my AWS-instance is still littered with problems which didn’t exist two weeks ago.

The major issue seems to be some sort of environment inconsistencies. I am however a noob, so I have no idea if that’s the case. At the moment I am trying to install the relevant packages for unpacking the tar files in Lesson 3’s Kaggle satellite images.

This input (run in the notebook):

! conda install --yes --prefix {sys.prefix} -c haasad eidl7zip

crashes and produces the following error message:

Collecting package metadata (current_repodata.json): done
Solving environment: done
WARNING conda.core.package_cache_data:_make_single_record(350): Encountered corrupt package tarball at /home/ubuntu/anaconda3/pkgs/_libgcc_mutex-0.1-main.conda. Conda has removed it, but you need to re-run conda to download it again.
WARNING conda.core.package_cache_data:_make_single_record(350): Encountered corrupt package tarball at /home/ubuntu/anaconda3/pkgs/ca-certificates-2019.5.15-0.conda. Conda has removed it, but you need to re-run conda to download it again.
WARNING conda.core.package_cache_data:_make_single_record(350): Encountered corrupt package tarball at /home/ubuntu/anaconda3/pkgs/certifi-2019.6.16-py36_0.conda. Conda has removed it, but you need to re-run conda to download it again.
WARNING conda.core.package_cache_data:_make_single_record(350): Encountered corrupt package tarball at /home/ubuntu/anaconda3/pkgs/openssl-1.0.2s-h7b6447c_0.conda. Conda has removed it, but you need to re-run conda to download it again.

Package Plan

environment location: /home/ubuntu/anaconda3/envs/pytorch_p36

added / updated specs:
- eidl7zip

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
_libgcc_mutex-0.1          |             main           3 KB
ca-certificates-2019.5.15  |                0         126 KB
certifi-2019.6.16          |           py36_0         150 KB
openssl-1.0.2s             |       h7b6447c_0         2.1 MB
------------------------------------------------------------
                                       Total:         2.4 MB

The following NEW packages will be INSTALLED:

_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
eidl7zip haasad/linux-64::eidl7zip-1.0.0-1

The following packages will be UPDATED:

ca-certificates 2019.1.23-0 --> 2019.5.15-0
certifi 2019.3.9-py36_0 --> 2019.6.16-py36_0
openssl 1.0.2r-h7b6447c_0 --> 1.0.2s-h7b6447c_0

Downloading and Extracting Packages
ca-certificates-2019 | 126 KB | ##################################### | 100%
certifi-2019.6.16 | 150 KB | ##################################### | 100%
_libgcc_mutex-0.1 | 3 KB | ##################################### | 100%
openssl-1.0.2s | 2.1 MB | ##################################### | 100%

InvalidArchiveError(‘Error with archive /home/ubuntu/anaconda3/pkgs/ca-certificates-2019.5.15-0.conda. You probably need to delete and re-download or re-create this file. Message from libarchive was:\n\nChild process exited with status 127 (errno=-1, retcode=-30, archive_p=94305117464384)’)
InvalidArchiveError(‘Error with archive /home/ubuntu/anaconda3/pkgs/certifi-2019.6.16-py36_0.conda. You probably need to delete and re-download or re-create this file. Message from libarchive was:\n\nChild process exited with status 127 (errno=-1, retcode=-30, archive_p=94305117464384)’)
InvalidArchiveError(‘Error with archive /home/ubuntu/anaconda3/pkgs/_libgcc_mutex-0.1-main.conda. You probably need to delete and re-download or re-create this file. Message from libarchive was:\n\nChild process exited with status 127 (errno=-1, retcode=-30, archive_p=94305117464384)’)
InvalidArchiveError(‘Error with archive /home/ubuntu/anaconda3/pkgs/openssl-1.0.2s-h7b6447c_0.conda. You probably need to delete and re-download or re-create this file. Message from libarchive was:\n\nChild process exited with status 127 (errno=-1, retcode=-30, archive_p=94305117464384)’)


Anyone had similar issues? Any solutions at hand? If it is any clue to the problem, I am also getting channel 3: open failed: connect failed: Connection refused in my instance terminal. ¯_(ツ)_/¯

Thanks. This works .

Hi Johan,

welcome the fastai forums! If you are just getting started with the course or with fastai in general, aws might not be the best choice. For just following the course, Google colab has worked well for me and is free!

However, I do not want to discourage you from tackling the issue you have if you are trying to get fastai working with aws, if you want to do that. I don’t think I can help with your exact error. What I would recommend is to delete the instance and start from scratch. Then follow the guide, but use the latest version of the Deep Learning AMI and inject the steps I described above.

Let me know if that still doesn’t get you going.

Thanks for taking your time @davidpfahler,

I have tried creating new instances several times, but no luck. My latest try I did get other issues however. This time it seemed to be version inconsistencies similar to the ones posted here. The solution presented in that thread (conda remove pykerberos) didn’t solve my issue unfortunatly :confused:

I’m gonna wait a week or two and then try again with a new instance. If the same issues arise, and I still can’t find any solution on the forum, I am going to try a different service than AWS.

Many thanks, again, for your reply!

If you want to use AWS to do the course, I’d probably just use something else. There is free alternatives like Google Colab or Azure Notesbooks, for example. Sorry, I can’t help you with your aws problem.

Johan,

I also ran into issues. I originally followed the instructions exactly as written without success, so began trying various permutations of advice on this forum ideas from stack overflow, and different machine images on AWS.

What ultimately worked for me was:

  1. Start with a new instance. I used the ubuntu 23.1 machine image. (I’d tried the Amazon Linux one too, and oddly, almost everything worked except that launching jupyter notebook failed to start a python kernel.)
  2. conda update conda. Lots of packages get installed and updated, but this proceeds pretty quickly.
  3. As @davidpfahler suggested:
    source ~/.bashrc
    This appears to switch to a virtual environment that the command prompt calls “(base)”
  4. conda activate pytorch_p36.
    This activates the machine instance’s pre-configured pytorch environment
  5. conda install -c pytorch -c fastai fastai pytorch torchvision cuda92.
    This is the step that had failed for me previously. What happens this time around is first I get a message: The environment is inconsistent, please check the package plan carefully. Lots of packages are listed. The installation then tries with the “next repodata source.” This again throws a message that the environment is inconsistent. And says “failed.” BUT then the script moves on and says: “Initial quick solve with frozen env failed. Unfreezing env and trying again.” This step seems to think for quite a while – 3-4 minutes of what seems like hanging. But at last, the script resumes executing and completes
  6. From here, jupyter notebook works as expected.

I’m not sure why the installation appears to need to spool through various repodata.json files. Maybe from within the virtual environment, one could run conda update --all or something?

But apart from that quirk, those steps worked for me.

Hope that helps.

1 Like