Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support RFC 5987 for attribute filename* in HTTP header Content-Disposition #4647

Merged
merged 8 commits into from
Jan 14, 2021

Conversation

aserkes
Copy link
Contributor

@aserkes aserkes commented Nov 30, 2020

Add support RFC 5987 for attribute filename* in HTTP header Content-Disposition

@@ -40,6 +41,8 @@
private Date modificationDate;
private Date readDate;
private long size;
private static final Pattern FILENAME_EXT_VALUE_PATTERN =
Pattern.compile("(UTF-8|ISO-8859-1)'([a-z]{2,8}(-[a-z0-9]+)?)?'(%[a-f0-9]{2}|[a-z0-9!#$&+.^_`|~-])+");
Copy link
Member

@jbescos jbescos Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expert on RFC 5987 but for example this (ISO-8859-1 in lower case) will not match:
iso-8859-1'language-us'abc%a1abc%a2%b1!#$&+.^_`|~-

It would be good to pick many examples of valid strings for testing.

Copy link
Contributor

@senivam senivam Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the PR should be to allow multi-language (non ASCII) file names. And, for example, this pattern fileNameExt = "UTF-8''українська_назва.pdf"; does not work. Same applies for the pattern: fileNameExt = "UTF-8''nombre_español.pdf";. And what is the most disapointing here, it does not allow capital letters (even ASCII) in the name.
@aserkes, could you please fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 5987 does not allow any symbols except in ASCII encoding. Symbols in any other encoding have to be encoded as described in https://tools.ietf.org/html/rfc3986#section-2.1

Copy link
Contributor

@senivam senivam Dec 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the pattern for multi-language file name shall look like fileNameExt = "UTF-8''nombre_espa%c3%b1ol.pdf"
And what about language-tag? The RFC 5987 references Section 2.3 of [RFC2978] but I'm not sure it can include the word language itself. It might be that the whole [ language ] construct in the ext-value description shall be substituted by language code from RFC 2978. Or do I understand this wrong?
And yes, it is legitimate to indicate encoding character insensitive, so comment and fix about this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language-Tag can exist or can be omitted. Requirements to this tag are described in https://tools.ietf.org/html/rfc5646#section-2.1 . This tag can have various content that is not always standardized and can expand, but main pattern for it looks like ([a-z]{2,8}(-[a-z0-9]+)?)? (if I understand RFC 5646 correctly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern seems to be relevant,

and another thing comes to mind - it might be good to extend existing constructor (along with related builders) with the fileNameExt parameter. Just not to take fileNameExt from header and to have the whole ContentDisposition class consistent.

@jansupol what do you think?

Copy link
Contributor

@jansupol jansupol Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC 5987 describes:

  1. The character set the recipient is about to support. Mandatory is UTF-8, ISO-88592-1 is optional as per RFC 8187, and other charsets are possible. However, Jersey does not support the charsets per se, the PR only checks the EBNF. From this point of view, I am not sure it makes sense to prohibit other charsets.
  2. File-name value characters. Jersey encodes data such as URL, query params,... for the users. This PR does not encode anything, however. Since the encoding is left on the user, I am not sure whether the code should restrict filename* values for characters that do not match the regular expression. The encoding is left on the users. Should they use a non-encoded symbol it is they who would deal with it at the end. The corresponding bug specifically asks for being able to add non-encoded symbols.

For these two reasons, should we just have the following?:

   private void createParameters() throws ParseException {
        fileName = parameters.get("filename");

        String fileNameExt = parameters.get("filename*");
        if (fileNameExt != null) {
            fileName = fileNameExt;
        }
        ....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal behaviour would be:

  1. Check whether the file name is encoded. If not, check the char set. Only UTF-8 needs to be supported per RFC 8187, and only UTF-8 is supported by Jersey internal econding utility UriComponent#encode. If the not encoded and not UTF-8, throw exception. If UTF-8, and non encoded, encode.
  2. If encoded, we do not care about the char set. UTF-8, ISO-8859-2, or any other is fine. Just check for valid characters, if not valid, throw exception. If valid check language.
  3. The language should be checked by the regular expression you have in PR. If not valid, throw exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes :

  1. Added encoding for a filename* parameter if it is valid, not encoded, and its charset is UTF-8.
  2. Jersey will throw an exception if a filename* parameter is not valid.
    At the same time I left only 2 possible charset for a filename* parameter (ISO-8859-1 and UTF-8) because as mentioned in RFC 5987 :

Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1"
([ISO-8859-1]) character set.

@senivam senivam linked an issue Dec 5, 2020 that may be closed by this pull request
Signed-off-by: Andrii Serkes <[email protected]>
…if a filename parameter is not valid

Signed-off-by: aserkes <[email protected]>
if (matcher.group(CHARSET_GROUP_NAME).equalsIgnoreCase("UTF-8")) {
return new StringBuilder(matcher.group(CHARSET_GROUP_NAME))
.append("'")
.append(matcher.group(LANG_GROUP_NAME) == null ? "" : matcher.group(LANG_GROUP_NAME))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regexps are possibly not the fastest piece of code. Can you store the result of matcher.group(LANG_GROUP_NAME) and reuse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: aserkes <[email protected]>
@senivam
Copy link
Contributor

senivam commented Jan 11, 2021

I would advise to bump copyright year to 2021 since those files where modified in the current year already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support contendisposition.filename* (RFC 5987)
4 participants